This notebook explores how diet quality relates to cardiometabolic risk and care in recent NHANES data. Diet is often treated as a single “healthy vs. unhealthy” lever for preventing diabetes and hypertension, but in practice it is tangled up with age, income, behavior bundles (smoking, heavy drinking), and patterns of contact with the health system.

Our goal here is to build an integrated NHANES analytic dataset (demographics, exams, disease/health-care questionnaires, and 24-hour dietary recalls) and use visualization to examine when and for whom a “better” diet shows up as better outcomes. Concretely, we ask: 1. How does diet quality relate to hypertension and diabetes across age groups and levels of adverse health behaviors (Behavior Load Index)? 2. Among adults with diagnosed hypertension or diabetes, does higher diet quality line up with better control, and does this differ by income? 3. Beyond a single diet score, where do the highest-risk individuals sit in the joint space of diet score, total energy, and sodium intake?

The following code chunks construct the analysis dataset, define our composite indices, and generate the figures used to answer these questions.

0. Setup

In this notebook I work with NHANES 2021–2023 data to explore how diet quality and a multi‐factor behavior load index (BLI) relate to access to care, diagnosis, and control of hypertension and diabetes. I use tidyverse tools for data engineering, then build a series of heatmaps, ridge plots, pathway bar charts, and ternary scatterplots to summarize these patterns.

The chunk below loads all required packages and sets a consistent minimalist theme for all figures.

1. Load and merge NHANES component files

NHANES is distributed across many component files (demographics, diet recalls, questionnaires, exam and lab data). Here I:

  1. Define the paths to all relevant XPT files.
  2. Read each file that exists and has SEQN (person ID).
  3. Inspect how many rows per person each file contains, which tells me which tables are long (e.g., diet item level) versus person-level.
  4. Collapse long tables to a single row per person (e.g., counts of foods, average blood pressure).
  5. Join everything into a single person-level dataset df with exactly one row per SEQN.
# Named vector of NHANES XPT component files

paths <- c(
DEMO   = "Data 237 Final_Datasets/DEMO_L.xpt",                     # demographics (weights, age, sex, race)
DR1TOT = "Data 237 Final_Datasets/dietary/DR1TOT_L.xpt",           # day 1 total nutrient intake
DR2TOT = "Data 237 Final_Datasets/dietary/DR2TOT_L.xpt",           # day 2 total nutrient intake
DR1IFF = "Data 237 Final_Datasets/dietary/DR1IFF_L.xpt",           # day 1 individual foods file
DR2IFF = "Data 237 Final_Datasets/dietary/DR2IFF_L.xpt",           # day 2 individual foods file
DPQ    = "Data 237 Final_Datasets/health outcome/DPQ_L.xpt",       # depression questionnaire
HSQ    = "Data 237 Final_Datasets/Other/HSQ_L.xpt",                # health status
MCQ    = "Data 237 Final_Datasets/health outcome/MCQ_L.xpt",       # medical conditions
BPQ    = "Data 237 Final_Datasets/health outcome/BPQ_L.xpt",       # blood pressure questionnaire
DIQ    = "Data 237 Final_Datasets/health outcome/DIQ_L.xpt",       # diabetes questionnaire
HUQ    = "Data 237 Final_Datasets/Other/HUQ_L.xpt",                # health care utilization
INQ    = "Data 237 Final_Datasets/Other/INQ_L.xpt",                # income and program participation
SMQ    = "Data 237 Final_Datasets/Other/SMQ_L.xpt",                # smoking
ALQ    = "Data 237 Final_Datasets/Other/ALQ_L.xpt",                # alcohol use
PAQ    = "Data 237 Final_Datasets/Other/PAQ_L.xpt",                # physical activity
SLQ    = "Data 237 Final_Datasets/Other/SLQ_L.xpt",                # sleep
DBQ    = "Data 237 Final_Datasets/Other/DBQ_L.xpt",                # diet behavior and nutrition
RXQ    = "Data 237 Final_Datasets/Other/RXQ_RX_L.xpt",             # prescription medications

# Optional measured anchors (Exam/Lab) for later analyses:

BMX    = "Data 237 Final_Datasets/Other/BMX_L.xpt",                # body measures (BMI, etc.)
BPXO   = "Data 237 Final_Datasets/Other/BPXO_L.xpt",               # oscillometric blood pressure
GHB    = "Data 237 Final_Datasets/Other/GHB_L.xpt",                # glycohemoglobin
TCHOL  = "Data 237 Final_Datasets/Other/TCHOL_L.xpt"               # total cholesterol
)

# Read each component that (a) exists on disk and (b) has SEQN (person identifier)

raw <- list()
for (nm in names(paths)) {
p <- paths[[nm]]
if (!file.exists(p)) next                  # silently skip missing files
dat <- read_xpt(p)                         # read XPT with haven
if (!"SEQN" %in% names(dat)) next          # keep only person-indexed files
raw[[nm]] <- dat
}

Next, I generate a quick duplicate report that summarizes how many rows per SEQN each component has. This helps distinguish long-form tables (e.g., diet items, multiple BP readings) from already person-level tables.

# For each component table, count rows per SEQN to see if it is long vs. wide

dup_report <- map_df(
names(raw),
~{
x <- raw[[.x]] %>% count(SEQN, name = "n_rows")
tibble(
file              = .x,
n_persons         = n_distinct(x$SEQN),
max_rows_per_seqn = max(x$n_rows, na.rm = TRUE),
pct_multi         = mean(x$n_rows > 1) * 100  # % of persons with >1 row
)
}
)

# Print the report ordered by maximum rows per SEQN

print(dup_report[order(-dup_report$max_rows_per_seqn), ], n = Inf)
## # A tibble: 22 × 4
##    file   n_persons max_rows_per_seqn pct_multi
##    <chr>      <int>             <int>     <dbl>
##  1 DR1IFF      6751                49      99.9
##  2 DR2IFF      5879                40     100.0
##  3 DEMO       11933                 1       0  
##  4 DR1TOT      8860                 1       0  
##  5 DR2TOT      8860                 1       0  
##  6 DPQ         6337                 1       0  
##  7 HSQ         6615                 1       0  
##  8 MCQ        11744                 1       0  
##  9 BPQ         8501                 1       0  
## 10 DIQ        11744                 1       0  
## 11 HUQ        11933                 1       0  
## 12 INQ        11933                 1       0  
## 13 SMQ         9015                 1       0  
## 14 ALQ         6337                 1       0  
## 15 PAQ         8153                 1       0  
## 16 SLQ         8501                 1       0  
## 17 DBQ        11933                 1       0  
## 18 RXQ        11933                 1       0  
## 19 BMX         8860                 1       0  
## 20 BPXO        7801                 1       0  
## 21 GHB         7199                 1       0  
## 22 TCHOL       8068                 1       0

Some components (e.g., individual foods, prescription medications, oscillometric blood pressure) are long-format with multiple rows per person. In the next chunk, I collapse these to per-person summaries and store them in an agg list that will later be merged with the person-level tables.

# Helper: average any present columns among a set of candidate column names

avg_cols <- function(df, cols) {
present <- intersect(names(df), cols)
if (length(present) == 0) return(NULL)
rowMeans(df[present], na.rm = TRUE)
}

agg <- list()

# A) Food-level intakes (DR1IFF / DR2IFF): derive simple per-person features

# Here I count how many food items were reported on day 1 and day 2.

if ("DR1IFF" %in% names(raw)) {
agg$DR1IFF <- raw$DR1IFF %>%
group_by(SEQN) %>%
summarise(
n_foods_day1 = n(),    # number of reported food items day 1
.groups = "drop"
)
}
if ("DR2IFF" %in% names(raw)) {
agg$DR2IFF <- raw$DR2IFF %>%
group_by(SEQN) %>%
summarise(
n_foods_day2 = n(),    # number of reported food items day 2
.groups = "drop"
)
}

# B) Medication list (RXQ_RX): count prescriptions per person

# Prefer distinct drug IDs if available; otherwise just count rows.

if ("RXQ" %in% names(raw)) {
med_id_col <- intersect(names(raw$RXQ), c("RXDDRGID","RXDRSC1","RXDRSC2"))
agg$RXQ <- raw$RXQ %>%
group_by(SEQN) %>%
summarise(
n_rx  = if (length(med_id_col)) n_distinct(.data[[med_id_col[1]]], na.rm = TRUE) else n(),
any_rx = as.integer(n() > 0),  # indicator for taking any medication
.groups = "drop"
)
}

# C) Oscillometric blood pressure (BPXO): average multiple readings per person

if ("BPXO" %in% names(raw)) {
bpx <- raw$BPXO %>%
mutate(
# average within visit across available SBP/DBP measurements
sbp_mean_row = avg_cols(., c("BPXOSY1","BPXOSY2","BPXOSY3")),
dbp_mean_row = avg_cols(., c("BPXODI1","BPXODI2","BPXODI3"))
) %>%
group_by(SEQN) %>%
summarise(
SBP_MEAN = mean(sbp_mean_row, na.rm = TRUE),
DBP_MEAN = mean(dbp_mean_row, na.rm = TRUE),
.groups = "drop"
)
agg$BPXO <- bpx
}

Finally, I identify the person-level tables (those already at one row per SEQN), append the aggregated long-form summaries, and merge everything into a single wide dataset df. I start the join from DEMO (if available) so that survey weights and design variables are preserved up front.

# Person-level tables: anything in `raw` that we did not aggregate above

person_level_names <- setdiff(names(raw), names(agg))
person_level <- raw[person_level_names]

# Build a single person-level table by joining all person-level + aggregated long tables

to_join <- c(person_level, agg)

# Start the join from DEMO if present, so weights/strata come first

start_name <- if ("DEMO" %in% names(to_join)) "DEMO" else names(to_join)[1]
df <- to_join[[start_name]]

# Sequentially left-join each remaining component on SEQN

for (nm in setdiff(names(to_join), start_name)) {
df <- full_join(df, to_join[[nm]], by = "SEQN")
}

# Sanity check: we should now have exactly one row per SEQN

stopifnot(!any(duplicated(df$SEQN)))
cat(
"One row per SEQN:", nrow(df), "respondents;",
ncol(df), "columns.\n"
)
## One row per SEQN: 11933 respondents; 432 columns.

2. Screening variables and dropping sparse columns

Before engineering higher-level constructs, I first assess how complete each variable is. The next chunk computes, for every column, the number and percentage of non-missing values. This gives a quick sense of which variables are well-populated versus mostly missing.

nonmiss_tbl <- tibble::tibble(
  column = names(df),
  class  = vapply(df, function(x) paste(class(x), collapse = ","), character(1)),
  non_na = vapply(df, function(x) sum(!is.na(x)), integer(1)),
  total  = nrow(df)
) %>%
  mutate(pct_non_na = round(100 * non_na / total, 1)) %>%
  arrange(desc(pct_non_na), column)

print(nonmiss_tbl, n = Inf)
## # A tibble: 432 × 5
##     column       class     non_na total pct_non_na
##     <chr>        <chr>      <int> <int>      <dbl>
##   1 DMDHHSIZ     numeric    11933 11933      100  
##   2 HUQ010       numeric    11933 11933      100  
##   3 HUQ030       numeric    11933 11933      100  
##   4 HUQ055       numeric    11933 11933      100  
##   5 RIAGENDR     numeric    11933 11933      100  
##   6 RIDAGEYR     numeric    11933 11933      100  
##   7 RIDRETH1     numeric    11933 11933      100  
##   8 RIDRETH3     numeric    11933 11933      100  
##   9 RIDSTATR     numeric    11933 11933      100  
##  10 SDDSRVYR     numeric    11933 11933      100  
##  11 SDMVPSU      numeric    11933 11933      100  
##  12 SDMVSTRA     numeric    11933 11933      100  
##  13 SEQN         numeric    11933 11933      100  
##  14 WTINT2YR     numeric    11933 11933      100  
##  15 WTMEC2YR     numeric    11933 11933      100  
##  16 any_rx       integer    11933 11933      100  
##  17 n_rx         integer    11933 11933      100  
##  18 DMDBORN4     numeric    11914 11933       99.8
##  19 AGQ030       numeric    11743 11933       98.4
##  20 DIQ010       numeric    11740 11933       98.4
##  21 MCQ010       numeric    11743 11933       98.4
##  22 MCQ053       numeric    11741 11933       98.4
##  23 HUQ090       numeric    11169 11933       93.6
##  24 HUQ042       numeric    10702 11933       89.7
##  25 INDFMMPC     numeric    10464 11933       87.7
##  26 INQ300       numeric    10468 11933       87.7
##  27 INDFMPIR     numeric     9892 11933       82.9
##  28 SMAQUEX2     numeric     9015 11933       75.5
##  29 INDFMMPI     numeric     8989 11933       75.3
##  30 BMDSTATS     numeric     8860 11933       74.2
##  31 DR1DRSTZ     numeric     8860 11933       74.2
##  32 DR2DRSTZ     numeric     8860 11933       74.2
##  33 RIDEXMON     numeric     8860 11933       74.2
##  34 WTDRD1.x     numeric     8860 11933       74.2
##  35 WTDRD1.y     numeric     8860 11933       74.2
##  36 BMXWT        numeric     8754 11933       73.4
##  37 BMXARMC      numeric     8562 11933       71.8
##  38 BMXARML      numeric     8568 11933       71.8
##  39 BMXHT        numeric     8499 11933       71.2
##  40 BPQ020       numeric     8498 11933       71.2
##  41 BPQ080       numeric     8498 11933       71.2
##  42 BPQ101D      numeric     8498 11933       71.2
##  43 SLQ300       character   8501 11933       71.2
##  44 SLQ310       character   8501 11933       71.2
##  45 SLQ320       character   8501 11933       71.2
##  46 SLQ330       character   8501 11933       71.2
##  47 DBQ930       numeric     8486 11933       71.1
##  48 DBQ935       numeric     8486 11933       71.1
##  49 DBQ940       numeric     8486 11933       71.1
##  50 DBQ945       numeric     8486 11933       71.1
##  51 BMXBMI       numeric     8471 11933       71  
##  52 SLD012       numeric     8388 11933       70.3
##  53 SLD013       numeric     8387 11933       70.3
##  54 DIQ180       numeric     8304 11933       69.6
##  55 DMQMILIZ     numeric     8301 11933       69.6
##  56 BMXWAIST     numeric     8190 11933       68.6
##  57 PAD790U      character   8153 11933       68.3
##  58 PAD810U      character   8153 11933       68.3
##  59 PAD680       numeric     8138 11933       68.2
##  60 PAD790Q      numeric     8135 11933       68.2
##  61 PAD810Q      numeric     8139 11933       68.2
##  62 SMQ020       numeric     8135 11933       68.2
##  63 WTPH2YR.y    numeric     8068 11933       67.6
##  64 DIQ160       numeric     8022 11933       67.2
##  65 MCQ160A      numeric     7807 11933       65.4
##  66 MCQ160B      numeric     7808 11933       65.4
##  67 MCQ160C      numeric     7807 11933       65.4
##  68 MCQ160D      numeric     7808 11933       65.4
##  69 MCQ160E      numeric     7807 11933       65.4
##  70 MCQ160F      numeric     7806 11933       65.4
##  71 MCQ160L      numeric     7807 11933       65.4
##  72 MCQ160M      numeric     7806 11933       65.4
##  73 MCQ160P      numeric     7807 11933       65.4
##  74 MCQ220       numeric     7807 11933       65.4
##  75 MCQ550       numeric     7807 11933       65.4
##  76 MCQ560       numeric     7807 11933       65.4
##  77 DMDEDUC2     numeric     7794 11933       65.3
##  78 DMDMARTZ     numeric     7792 11933       65.3
##  79 DBP_MEAN     numeric     7518 11933       63  
##  80 SBP_MEAN     numeric     7518 11933       63  
##  81 BMXLEG       numeric     7335 11933       61.5
##  82 WTPH2YR.x    numeric     7199 11933       60.3
##  83 LBDTCSI      numeric     6890 11933       57.7
##  84 LBXTC        numeric     6890 11933       57.7
##  85 DBQ095Z      numeric     6797 11933       57  
##  86 DR1DAY       numeric     6797 11933       57  
##  87 DR1EXMER     numeric     6797 11933       57  
##  88 DR1LANG      numeric     6802 11933       57  
##  89 DR1STY       numeric     6797 11933       57  
##  90 DR1TWSZ      numeric     6797 11933       57  
##  91 DR1_300      numeric     6797 11933       57  
##  92 DRQSDIET     numeric     6797 11933       57  
##  93 DRQSPREP     numeric     6797 11933       57  
##  94 BMXHIP       numeric     6776 11933       56.8
##  95 DR1BWATZ     numeric     6754 11933       56.6
##  96 DR1TNUMF     numeric     6754 11933       56.6
##  97 DR1_320Z     numeric     6754 11933       56.6
##  98 DR1_330Z     numeric     6754 11933       56.6
##  99 DRDINT.x     numeric     6754 11933       56.6
## 100 DRDINT.y     numeric     6754 11933       56.6
## 101 WTDR2D.x     numeric     6754 11933       56.6
## 102 WTDR2D.y     numeric     6754 11933       56.6
## 103 n_foods_day1 integer     6751 11933       56.6
## 104 DR1HELP      numeric     6747 11933       56.5
## 105 DR1MRESP     numeric     6747 11933       56.5
## 106 DRABF.x      numeric     6734 11933       56.4
## 107 DRABF.y      numeric     6734 11933       56.4
## 108 LBXGH        numeric     6715 11933       56.3
## 109 DR1TACAR     numeric     6694 11933       56.1
## 110 DR1TALCO     numeric     6694 11933       56.1
## 111 DR1TATOA     numeric     6694 11933       56.1
## 112 DR1TATOC     numeric     6694 11933       56.1
## 113 DR1TB12A     numeric     6694 11933       56.1
## 114 DR1TBCAR     numeric     6694 11933       56.1
## 115 DR1TCAFF     numeric     6694 11933       56.1
## 116 DR1TCALC     numeric     6694 11933       56.1
## 117 DR1TCARB     numeric     6694 11933       56.1
## 118 DR1TCHL      numeric     6694 11933       56.1
## 119 DR1TCHOL     numeric     6694 11933       56.1
## 120 DR1TCOPP     numeric     6694 11933       56.1
## 121 DR1TCRYP     numeric     6694 11933       56.1
## 122 DR1TFA       numeric     6694 11933       56.1
## 123 DR1TFDFE     numeric     6694 11933       56.1
## 124 DR1TFF       numeric     6694 11933       56.1
## 125 DR1TFIBE     numeric     6694 11933       56.1
## 126 DR1TFOLA     numeric     6694 11933       56.1
## 127 DR1TIRON     numeric     6694 11933       56.1
## 128 DR1TKCAL     numeric     6694 11933       56.1
## 129 DR1TLYCO     numeric     6694 11933       56.1
## 130 DR1TLZ       numeric     6694 11933       56.1
## 131 DR1TM161     numeric     6694 11933       56.1
## 132 DR1TM181     numeric     6694 11933       56.1
## 133 DR1TM201     numeric     6694 11933       56.1
## 134 DR1TM221     numeric     6694 11933       56.1
## 135 DR1TMAGN     numeric     6694 11933       56.1
## 136 DR1TMFAT     numeric     6694 11933       56.1
## 137 DR1TMOIS     numeric     6694 11933       56.1
## 138 DR1TNIAC     numeric     6694 11933       56.1
## 139 DR1TP182     numeric     6694 11933       56.1
## 140 DR1TP183     numeric     6694 11933       56.1
## 141 DR1TP184     numeric     6694 11933       56.1
## 142 DR1TP204     numeric     6694 11933       56.1
## 143 DR1TP205     numeric     6694 11933       56.1
## 144 DR1TP225     numeric     6694 11933       56.1
## 145 DR1TP226     numeric     6694 11933       56.1
## 146 DR1TPFAT     numeric     6694 11933       56.1
## 147 DR1TPHOS     numeric     6694 11933       56.1
## 148 DR1TPOTA     numeric     6694 11933       56.1
## 149 DR1TPROT     numeric     6694 11933       56.1
## 150 DR1TRET      numeric     6694 11933       56.1
## 151 DR1TS040     numeric     6694 11933       56.1
## 152 DR1TS060     numeric     6694 11933       56.1
## 153 DR1TS080     numeric     6694 11933       56.1
## 154 DR1TS100     numeric     6694 11933       56.1
## 155 DR1TS120     numeric     6694 11933       56.1
## 156 DR1TS140     numeric     6694 11933       56.1
## 157 DR1TS160     numeric     6694 11933       56.1
## 158 DR1TS180     numeric     6694 11933       56.1
## 159 DR1TSELE     numeric     6694 11933       56.1
## 160 DR1TSFAT     numeric     6694 11933       56.1
## 161 DR1TSODI     numeric     6694 11933       56.1
## 162 DR1TSUGR     numeric     6694 11933       56.1
## 163 DR1TTFAT     numeric     6694 11933       56.1
## 164 DR1TTHEO     numeric     6694 11933       56.1
## 165 DR1TVARA     numeric     6694 11933       56.1
## 166 DR1TVB1      numeric     6694 11933       56.1
## 167 DR1TVB12     numeric     6694 11933       56.1
## 168 DR1TVB2      numeric     6694 11933       56.1
## 169 DR1TVB6      numeric     6694 11933       56.1
## 170 DR1TVC       numeric     6694 11933       56.1
## 171 DR1TVD       numeric     6694 11933       56.1
## 172 DR1TVK       numeric     6694 11933       56.1
## 173 DR1TZINC     numeric     6694 11933       56.1
## 174 DRD340       numeric     6694 11933       56.1
## 175 DRD360       numeric     6695 11933       56.1
## 176 PAD800       numeric     6390 11933       53.5
## 177 DR1DBIH      numeric     6375 11933       53.4
## 178 DR2LANG      numeric     5929 11933       49.7
## 179 DR2DAY       numeric     5902 11933       49.5
## 180 DR2EXMER     numeric     5902 11933       49.5
## 181 DR2STY       numeric     5902 11933       49.5
## 182 DR2TWSZ      numeric     5902 11933       49.5
## 183 DR2_300      numeric     5902 11933       49.5
## 184 DR2BWATZ     numeric     5879 11933       49.3
## 185 DR2MRESP     numeric     5879 11933       49.3
## 186 DR2TNUMF     numeric     5879 11933       49.3
## 187 DR2_320Z     numeric     5879 11933       49.3
## 188 DR2_330Z     numeric     5879 11933       49.3
## 189 n_foods_day2 integer     5879 11933       49.3
## 190 DR2HELP      numeric     5876 11933       49.2
## 191 DR2TACAR     numeric     5830 11933       48.9
## 192 DR2TALCO     numeric     5830 11933       48.9
## 193 DR2TATOA     numeric     5830 11933       48.9
## 194 DR2TATOC     numeric     5830 11933       48.9
## 195 DR2TB12A     numeric     5830 11933       48.9
## 196 DR2TBCAR     numeric     5830 11933       48.9
## 197 DR2TCAFF     numeric     5830 11933       48.9
## 198 DR2TCALC     numeric     5830 11933       48.9
## 199 DR2TCARB     numeric     5830 11933       48.9
## 200 DR2TCHL      numeric     5830 11933       48.9
## 201 DR2TCHOL     numeric     5830 11933       48.9
## 202 DR2TCOPP     numeric     5830 11933       48.9
## 203 DR2TCRYP     numeric     5830 11933       48.9
## 204 DR2TFA       numeric     5830 11933       48.9
## 205 DR2TFDFE     numeric     5830 11933       48.9
## 206 DR2TFF       numeric     5830 11933       48.9
## 207 DR2TFIBE     numeric     5830 11933       48.9
## 208 DR2TFOLA     numeric     5830 11933       48.9
## 209 DR2TIRON     numeric     5830 11933       48.9
## 210 DR2TKCAL     numeric     5830 11933       48.9
## 211 DR2TLYCO     numeric     5830 11933       48.9
## 212 DR2TLZ       numeric     5830 11933       48.9
## 213 DR2TM161     numeric     5830 11933       48.9
## 214 DR2TM181     numeric     5830 11933       48.9
## 215 DR2TM201     numeric     5830 11933       48.9
## 216 DR2TM221     numeric     5830 11933       48.9
## 217 DR2TMAGN     numeric     5830 11933       48.9
## 218 DR2TMFAT     numeric     5830 11933       48.9
## 219 DR2TMOIS     numeric     5830 11933       48.9
## 220 DR2TNIAC     numeric     5830 11933       48.9
## 221 DR2TP182     numeric     5830 11933       48.9
## 222 DR2TP183     numeric     5830 11933       48.9
## 223 DR2TP184     numeric     5830 11933       48.9
## 224 DR2TP204     numeric     5830 11933       48.9
## 225 DR2TP205     numeric     5830 11933       48.9
## 226 DR2TP225     numeric     5830 11933       48.9
## 227 DR2TP226     numeric     5830 11933       48.9
## 228 DR2TPFAT     numeric     5830 11933       48.9
## 229 DR2TPHOS     numeric     5830 11933       48.9
## 230 DR2TPOTA     numeric     5830 11933       48.9
## 231 DR2TPROT     numeric     5830 11933       48.9
## 232 DR2TRET      numeric     5830 11933       48.9
## 233 DR2TS040     numeric     5830 11933       48.9
## 234 DR2TS060     numeric     5830 11933       48.9
## 235 DR2TS080     numeric     5830 11933       48.9
## 236 DR2TS100     numeric     5830 11933       48.9
## 237 DR2TS120     numeric     5830 11933       48.9
## 238 DR2TS140     numeric     5830 11933       48.9
## 239 DR2TS160     numeric     5830 11933       48.9
## 240 DR2TS180     numeric     5830 11933       48.9
## 241 DR2TSELE     numeric     5830 11933       48.9
## 242 DR2TSFAT     numeric     5830 11933       48.9
## 243 DR2TSODI     numeric     5830 11933       48.9
## 244 DR2TSUGR     numeric     5830 11933       48.9
## 245 DR2TTFAT     numeric     5830 11933       48.9
## 246 DR2TTHEO     numeric     5830 11933       48.9
## 247 DR2TVARA     numeric     5830 11933       48.9
## 248 DR2TVB1      numeric     5830 11933       48.9
## 249 DR2TVB12     numeric     5830 11933       48.9
## 250 DR2TVB2      numeric     5830 11933       48.9
## 251 DR2TVB6      numeric     5830 11933       48.9
## 252 DR2TVC       numeric     5830 11933       48.9
## 253 DR2TVD       numeric     5830 11933       48.9
## 254 DR2TVK       numeric     5830 11933       48.9
## 255 DR2TZINC     numeric     5830 11933       48.9
## 256 HSQ590       numeric     5751 11933       48.2
## 257 OSQ230       numeric     5723 11933       48  
## 258 IND310       numeric     5650 11933       47.3
## 259 DR2DBIH      numeric     5550 11933       46.5
## 260 DPQ010       numeric     5519 11933       46.2
## 261 DPQ020       numeric     5518 11933       46.2
## 262 DPQ030       numeric     5516 11933       46.2
## 263 DPQ040       numeric     5514 11933       46.2
## 264 DPQ050       numeric     5513 11933       46.2
## 265 DPQ060       numeric     5510 11933       46.2
## 266 DPQ070       numeric     5508 11933       46.2
## 267 DPQ080       numeric     5508 11933       46.2
## 268 DPQ090       numeric     5506 11933       46.1
## 269 ALQ111       numeric     5481 11933       45.9
## 270 ALQ121       numeric     4922 11933       41.2
## 271 ALQ151       numeric     4901 11933       41.1
## 272 DBD100       numeric     4742 11933       39.7
## 273 DRD370A      numeric     4175 11933       35  
## 274 DRD370B      numeric     4175 11933       35  
## 275 DRD370C      numeric     4175 11933       35  
## 276 DRD370D      numeric     4175 11933       35  
## 277 DRD370E      numeric     4175 11933       35  
## 278 DRD370F      numeric     4175 11933       35  
## 279 DRD370G      numeric     4175 11933       35  
## 280 DRD370H      numeric     4175 11933       35  
## 281 DRD370I      numeric     4175 11933       35  
## 282 DRD370J      numeric     4175 11933       35  
## 283 DRD370K      numeric     4175 11933       35  
## 284 DRD370L      numeric     4175 11933       35  
## 285 DRD370M      numeric     4175 11933       35  
## 286 DRD370N      numeric     4175 11933       35  
## 287 DRD370O      numeric     4175 11933       35  
## 288 DRD370P      numeric     4175 11933       35  
## 289 DRD370Q      numeric     4175 11933       35  
## 290 DRD370R      numeric     4175 11933       35  
## 291 DRD370S      numeric     4175 11933       35  
## 292 DRD370T      numeric     4175 11933       35  
## 293 DRD370U      numeric     4175 11933       35  
## 294 DRD370V      numeric     4173 11933       35  
## 295 DPQ100       numeric     4167 11933       34.9
## 296 DMDHRAGZ     numeric     4124 11933       34.6
## 297 DMDHRGND     numeric     4115 11933       34.5
## 298 ALQ142       numeric     4082 11933       34.2
## 299 ALQ130       numeric     4069 11933       34.1
## 300 DMDHRMAZ     numeric     4020 11933       33.7
## 301 DMDHREDZ     numeric     3746 11933       31.4
## 302 PAD820       numeric     3687 11933       30.9
## 303 DBQ301       numeric     3500 11933       29.3
## 304 DBQ330       numeric     3500 11933       29.3
## 305 DBQ360       numeric     3357 11933       28.1
## 306 SMQ040       numeric     3243 11933       27.2
## 307 DRD350A      numeric     3033 11933       25.4
## 308 DRD350B      numeric     3033 11933       25.4
## 309 DRD350C      numeric     3033 11933       25.4
## 310 DRD350D      numeric     3033 11933       25.4
## 311 DRD350E      numeric     3033 11933       25.4
## 312 DRD350F      numeric     3033 11933       25.4
## 313 DRD350G      numeric     3033 11933       25.4
## 314 DRD350H      numeric     3033 11933       25.4
## 315 DRD350I      numeric     3032 11933       25.4
## 316 DRD350J      numeric     3032 11933       25.4
## 317 DRD350K      numeric     3031 11933       25.4
## 318 BPQ030       numeric     2968 11933       24.9
## 319 BPQ150       numeric     2969 11933       24.9
## 320 RIDEXAGM     numeric     2787 11933       23.4
## 321 DBQ370       numeric     2762 11933       23.1
## 322 DBQ400       numeric     2762 11933       23.1
## 323 DRD350HQ     numeric     2660 11933       22.3
## 324 DBD381       numeric     2643 11933       22.1
## 325 MCQ195       numeric     2532 11933       21.2
## 326 BMDBMIC      numeric     2492 11933       20.9
## 327 DBD411       numeric     2378 11933       19.9
## 328 ALQ170       numeric     2358 11933       19.8
## 329 ALQ270       numeric     2366 11933       19.8
## 330 ALQ280       numeric     2362 11933       19.8
## 331 DIQ070       numeric     2281 11933       19.1
## 332 DBQ390       numeric     2159 11933       18.1
## 333 DMDHSEDZ     numeric     2127 11933       17.8
## 334 DRD370MQ     numeric     2037 11933       17.1
## 335 MCQ035       numeric     1946 11933       16.3
## 336 DMDYRUSR     numeric     1875 11933       15.7
## 337 DRD370BQ     numeric     1864 11933       15.6
## 338 DBQ424       numeric     1780 11933       14.9
## 339 MCQ500       numeric     1578 11933       13.2
## 340 RIDEXPRG     numeric     1503 11933       12.6
## 341 DBD041       numeric     1441 11933       12.1
## 342 DBD055       numeric     1440 11933       12.1
## 343 DBQ010       numeric     1441 11933       12.1
## 344 DBQ421       numeric     1400 11933       11.7
## 345 DBD061       numeric     1356 11933       11.4
## 346 MCQ040       numeric     1219 11933       10.2
## 347 MCQ050       numeric     1219 11933       10.2
## 348 DR1SKY       numeric     1197 11933       10  
## 349 SMD650       numeric     1185 11933        9.9
## 350 MCQ230A      numeric     1169 11933        9.8
## 351 SMD100MN     numeric     1175 11933        9.8
## 352 DBD050       numeric     1153 11933        9.7
## 353 DBD030       numeric     1121 11933        9.4
## 354 DID040       numeric     1081 11933        9.1
## 355 DIQ050       numeric     1081 11933        9.1
## 356 DR2SKY       numeric     1074 11933        9  
## 357 MCQ170M      numeric     1053 11933        8.8
## 358 DRD370TQ     numeric      982 11933        8.2
## 359 DBQ073A      numeric      869 11933        7.3
## 360 SMQ621       numeric      772 11933        6.5
## 361 DRD370EQ     numeric      701 11933        5.9
## 362 DRD370AQ     numeric      628 11933        5.3
## 363 DRD350BQ     numeric      548 11933        4.6
## 364 BMXRECUM     numeric      454 11933        3.8
## 365 DRQSDT1      numeric      447 11933        3.7
## 366 MCQ149       numeric      434 11933        3.6
## 367 MCQ170L      numeric      425 11933        3.6
## 368 BMILEG       numeric      396 11933        3.3
## 369 RIDAGEMN     numeric      377 11933        3.2
## 370 BMIHIP       numeric      361 11933        3  
## 371 DRD350GQ     numeric      357 11933        3  
## 372 BMIWAIST     numeric      347 11933        2.9
## 373 BMIWT        numeric      345 11933        2.9
## 374 DID060       numeric      343 11933        2.9
## 375 DRD370DQ     numeric      347 11933        2.9
## 376 DIQ060U      numeric      332 11933        2.8
## 377 DRD350DQ     numeric      313 11933        2.6
## 378 SMD641       numeric      273 11933        2.3
## 379 MCQ510A      numeric      260 11933        2.2
## 380 DRD350AQ     numeric      256 11933        2.1
## 381 DRD350FQ     numeric      230 11933        1.9
## 382 DRD350IQ     numeric      223 11933        1.9
## 383 DRD370GQ     numeric      223 11933        1.9
## 384 DRD350EQ     numeric      217 11933        1.8
## 385 DRD370NQ     numeric      213 11933        1.8
## 386 BMIARMC      numeric      205 11933        1.7
## 387 BMIARML      numeric      200 11933        1.7
## 388 DBQ073B      numeric      180 11933        1.5
## 389 DRD370KQ     numeric      172 11933        1.4
## 390 DRD370RQ     numeric      153 11933        1.3
## 391 MCQ230B      numeric      159 11933        1.3
## 392 DRQSDT91     numeric      148 11933        1.2
## 393 BMIHT        numeric      134 11933        1.1
## 394 DRD370FQ     numeric      134 11933        1.1
## 395 DRD370UQ     numeric      124 11933        1  
## 396 DRD370SQ     numeric       96 11933        0.8
## 397 DRQSDT3      numeric       96 11933        0.8
## 398 DRQSDT7      numeric       93 11933        0.8
## 399 DBQ073U      numeric       82 11933        0.7
## 400 DRD370HQ     numeric       85 11933        0.7
## 401 DRD370OQ     numeric       83 11933        0.7
## 402 MCQ510F      numeric       79 11933        0.7
## 403 BMXHEAD      numeric       70 11933        0.6
## 404 DBQ073C      numeric       66 11933        0.6
## 405 DRD370IQ     numeric       76 11933        0.6
## 406 DRQSDT2      numeric       70 11933        0.6
## 407 DRQSDT9      numeric       70 11933        0.6
## 408 DRD350CQ     numeric       62 11933        0.5
## 409 DRD370CQ     numeric       60 11933        0.5
## 410 MCQ510D      numeric       62 11933        0.5
## 411 DRD370QQ     numeric       45 11933        0.4
## 412 DRQSDT4      numeric       44 11933        0.4
## 413 MCQ510C      numeric       47 11933        0.4
## 414 DRQSDT10     numeric       33 11933        0.3
## 415 DRQSDT11     numeric       32 11933        0.3
## 416 BMIRECUM     numeric       18 11933        0.2
## 417 DBQ073E      numeric       21 11933        0.2
## 418 DRD370LQ     numeric       18 11933        0.2
## 419 DRQSDT8      numeric       27 11933        0.2
## 420 MCQ230C      numeric       26 11933        0.2
## 421 SMD630       numeric       23 11933        0.2
## 422 DBQ073D      numeric       16 11933        0.1
## 423 DRD350JQ     numeric       15 11933        0.1
## 424 DRD370JQ     numeric       12 11933        0.1
## 425 DRQSDT12     numeric       13 11933        0.1
## 426 DRQSDT6      numeric        7 11933        0.1
## 427 MCQ510E      numeric       16 11933        0.1
## 428 BMIHEAD      numeric        0 11933        0  
## 429 DRD370PQ     numeric        4 11933        0  
## 430 DRQSDT5      numeric        2 11933        0  
## 431 MCQ230D      numeric        2 11933        0  
## 432 MCQ510B      numeric        4 11933        0

For the analysis, I restrict attention to variables with at least 100 non-missing observations, which balances keeping useful signal and avoiding extremely sparse fields. I also save this cleaned person-level dataset to disk for reproducibility.

keep_by_nonmissing <- function(data, n = 100) {
  counts <- colSums(!is.na(data))
  keep   <- counts >= n

  kept_cols    <- names(data)[keep]
  dropped_cols <- names(data)[!keep]

  message(sprintf("Kept %d cols, dropped %d cols (threshold = %d non-NA).",
                  length(kept_cols), length(dropped_cols), n))

  list(
    df      = data[, kept_cols, drop = FALSE],
    counts  = counts,
    kept    = kept_cols,
    dropped = dropped_cols
  )
}

res <- keep_by_nonmissing(df, n = 100)
## Kept 395 cols, dropped 37 cols (threshold = 100 non-NA).
df  <- res$df 

out_path <- "Data 237 Final_Datasets/df_clean.csv"

# readr::write_csv is fast and UTF-8 safe
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
readr::write_csv(df, out_path)

3. Building a codebook and data dictionary

NHANES XPT files carry both variable labels (descriptions) and, for labelled variables, value labels (e.g., 1 = Yes, 2 = No). To keep track of this metadata, I construct a codebook across all component files and then a data dictionary for the final merged df_clean.csv. This lets me trace each analytic variable back to its original NHANES source and interpret categorical codes correctly.

# Function: extract labels (variable & value labels) from one XPT
xpt_to_codebook <- function(path, source_name = basename(path)) {
  if (!file.exists(path)) return(tibble())
  dx <- read_xpt(path)
  vars <- names(dx)
  # variable labels
  var_lab <- map_chr(dx, ~ attr(.x, "label") %||% "")
  # class info
  classes <- map_chr(dx, ~ paste(class(.x), collapse = "|"))
  # value labels (if labelled)
  val_labs <- map(dx, function(x) {
    labs <- attr(x, "labels", exact = TRUE)
    if (is.null(labs)) return(NA_character_)
    # turn named integer vector into "code=label; code=label; ..."
    paste(paste0(unname(labs), "=", names(labs)), collapse = "; ")
  }) |> unlist(use.names = FALSE)

  tibble(
    source = source_name,
    varname = vars,
    var_label = var_lab,
    class = classes,
    value_labels = val_labs
  )
}

# Build merged codebook from all XPTs
codebook <- imap_dfr(paths, ~ xpt_to_codebook(.x, source_name = .y))

codebook_path <- "Data 237 Final_Datasets/NHANES_2123_codebook_from_XPT.csv"
#write_csv(codebook, codebook_path) #print if viewer would like to see detailed descriptions for each dataset column, a csv version is included in the submission folder.

# Augment a CSV dictionary for df_clean.csv
df_csv_path <- "Data 237 Final_Datasets/df_clean.csv"
if (file.exists(df_csv_path)) {
  df_csv <- read_csv(df_csv_path, show_col_types = FALSE)
  dict <- tibble(
    varname = names(df_csv),
    class_in_csv = vapply(df_csv, function(x) paste(class(x), collapse = "|"), character(1)),
    non_na = vapply(df_csv, function(x) sum(!is.na(x)), integer(1)),
    n_unique = vapply(df_csv, function(x) dplyr::n_distinct(x, na.rm = TRUE), integer(1))
  ) |>
    left_join(codebook |> select(source, varname, var_label, value_labels),
              by = "varname") |>
    arrange(varname)

  dict_path <- "Data 237 Final_Datasets/df_clean_DATA_DICTIONARY.csv"
  write_csv(dict, dict_path)
  message("Wrote df dictionary: ", normalizePath(dict_path))
}
## Wrote df dictionary: /Users/leon03/Downloads/Leon Zhou_Data237 Final Project_Final/Data 237 Final_Datasets/df_clean_DATA_DICTIONARY.csv

4. Constructing analytic variables and behavior load index

The main goal is to relate diet quality and behavior load to cardiometabolic outcomes. To do this I:

These steps create a compact analytic dataset df1 with socio-demographics, behaviors, and clinical markers that I will use for all downstream plots.

# Helper: given a regex pattern, find the first matching column name
# in the NHANES dictionary that is actually present in df.
find_col <- function(pattern) {
  hits <- dict$varname[str_detect(dict$varname, regex(pattern, ignore_case = TRUE))]
  hits <- hits[hits %in% names(df)]
  if (length(hits)) hits[1] else NA_character_
}

# Map each conceptual construct to its underlying NHANES variable(s).
# The regexes allow some robustness across different naming versions.
vars <- list(
  age     = find_col("^RIDAGEYR$|RIDAGE"),           # age in years
  sex     = find_col("^RIAGENDR$|RIAGEN"),           # sex (1=Male, 2=Female)
  race    = find_col("^RIDRETH3$|RIDRETH"),          # race/ethnicity
  pir     = find_col("^INDFMPIR$|PIR"),              # income-to-poverty ratio
  access  = find_col("^HUQ090$|HUQ090"),             # seen/talked to provider past 12m (1=yes)
  btest   = find_col("^DIQ180$|DIQ180"),             # diabetes blood test past 12m (1=yes)
  dx_htn  = find_col("^BPQ020$|BPQ020"),             # ever told HTN (1=yes)
  dx_dm   = find_col("^DIQ010$|DIQ010"),             # ever told DM (1=yes)
  bmi     = find_col("^BMXBMI$|BMI"),                # body mass index
  sbp     = find_col("^SBP_MEAN$|BPXOSY|SBP"),       # SBP (prefer aggregated SBP_MEAN if present)
  dbp     = find_col("^DBP_MEAN$|BPXODI|DBP"),       # DBP
  a1c     = find_col("^LBXGH$|^HBA1C$|^GHB"),        # glycohemoglobin A1c
  sleep   = find_col("^SLD010H$|^SLQ0|SLEEP"),       # usual hours sleep
  smoke   = find_col("^SMQ020$|^SMQ0|SMOK"),         # current smoking indicator
  alc_days= find_col("^ALQ120Q$|^ALQ101$|^ALQ130$|ALC"),    # alcohol frequency proxy
  pa_mod  = find_col("^PAQ6|^PAQ65|^PAQ706|^PAQ(.*)MOD"),   # moderate PA minutes or frequency
  dpq_sum = find_col("^DPQ(.*)TOT$|^DPQ\\d+$")              # PHQ-9 total or individual DPQ items
)
print(vars)
## $age
## [1] "RIDAGEMN"
## 
## $sex
## [1] "RIAGENDR"
## 
## $race
## [1] "RIDRETH1"
## 
## $pir
## [1] "INDFMPIR"
## 
## $access
## [1] "HUQ090"
## 
## $btest
## [1] "DIQ180"
## 
## $dx_htn
## [1] "BPQ020"
## 
## $dx_dm
## [1] "DIQ010"
## 
## $bmi
## [1] "BMDBMIC"
## 
## $sbp
## [1] "SBP_MEAN"
## 
## $dbp
## [1] "DBP_MEAN"
## 
## $a1c
## [1] "LBXGH"
## 
## $sleep
## [1] NA
## 
## $smoke
## [1] "SMQ020"
## 
## $alc_days
## [1] "ALQ130"
## 
## $pa_mod
## [1] NA
## 
## $dpq_sum
## [1] "DPQ010"
# Generic accessors to standardize how we pull and transform columns
col_or_na <- function(data, nm) {
  # Return the column if it exists, otherwise a vector of NAs
  if (!is.na(nm) && length(nm) == 1 && nm %in% names(data)) data[[nm]] else NA
}
yn_vec <- function(x) as.integer(!is.na(x) & x == 1)           # convert NHANES 1=yes to 0/1
num_vec <- function(x) suppressWarnings(as.numeric(x))         # numeric coercion with warning suppression

# Build df1: a person-level analytic frame with typed variables
df1 <- df %>%
  mutate(
    # Demographics and income
    age     = col_or_na(df, vars$age),
    sex_raw = col_or_na(df, vars$sex),
    sex     = factor(ifelse(is.na(sex_raw), NA, sex_raw),
                     levels = c(1, 2), labels = c("Male","Female")),
    pir     = col_or_na(df, vars$pir),
    pir_grp = cut(
      pir,
      breaks = quantile(pir, probs = c(0, 1/3, 2/3, 1), na.rm = TRUE),
      include.lowest = TRUE,
      labels = c("Low PIR","Mid PIR","High PIR")
    ),

    # Access and diagnosis indicators
    access  = yn_vec(col_or_na(df, vars$access)),
    btest   = yn_vec(col_or_na(df, vars$btest)),
    dx_htn  = yn_vec(col_or_na(df, vars$dx_htn)),
    dx_dm   = yn_vec(col_or_na(df, vars$dx_dm)),

    # Clinical measurements
    bmi     = num_vec(col_or_na(df, vars$bmi)),
    sbp     = num_vec(col_or_na(df, vars$sbp)),
    dbp     = num_vec(col_or_na(df, vars$dbp)),
    a1c     = num_vec(col_or_na(df, vars$a1c)),

    # Behavioral measures
    sleep_h = num_vec(col_or_na(df, vars$sleep)),      # usual hours of sleep
    smk_now = yn_vec(col_or_na(df, vars$smoke)),       # current smoker (0/1)
    alc_days= num_vec(col_or_na(df, vars$alc_days)),   # drinking frequency proxy
    pa_mod  = num_vec(col_or_na(df, vars$pa_mod))      # moderate physical activity
  )

# Quick sanity check on basic distributions
summary(select(
  df1, age, sex, pir, pir_grp, access, btest, dx_htn, dx_dm,
  bmi, sbp, dbp, a1c, sleep_h, smk_now, alc_days, pa_mod
))
##       age            sex            pir            pir_grp         access      
##  Min.   : 0.00   Male  :5575   Min.   :0.000   Low PIR :3311   Min.   :0.0000  
##  1st Qu.: 6.00   Female:6358   1st Qu.:1.180   Mid PIR :3305   1st Qu.:0.0000  
##  Median :11.00                 Median :2.500   High PIR:3276   Median :0.0000  
##  Mean   :11.63                 Mean   :2.708   NA's    :2041   Mean   :0.1378  
##  3rd Qu.:17.00                 3rd Qu.:4.500                   3rd Qu.:0.0000  
##  Max.   :24.00                 Max.   :5.000                   Max.   :1.0000  
##  NA's   :11556                 NA's   :2041                                    
##      btest            dx_htn           dx_dm              bmi       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:2.000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :0.2824   Mean   :0.2488   Mean   :0.09059   Mean   :2.561  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :4.000  
##                                                      NA's   :9441   
##       sbp             dbp              a1c           sleep_h     
##  Min.   : 70.0   Min.   : 34.00   Min.   : 3.20   Min.   : NA    
##  1st Qu.:106.3   1st Qu.: 64.00   1st Qu.: 5.20   1st Qu.: NA    
##  Median :116.3   Median : 71.67   Median : 5.50   Median : NA    
##  Mean   :119.1   Mean   : 72.21   Mean   : 5.71   Mean   :NaN    
##  3rd Qu.:129.0   3rd Qu.: 79.33   3rd Qu.: 5.80   3rd Qu.: NA    
##  Max.   :232.3   Max.   :139.00   Max.   :17.10   Max.   : NA    
##  NA's   :4415    NA's   :4415     NA's   :5218    NA's   :11933  
##     smk_now          alc_days           pa_mod     
##  Min.   :0.0000   Min.   :  1.000   Min.   : NA    
##  1st Qu.:0.0000   1st Qu.:  1.000   1st Qu.: NA    
##  Median :0.0000   Median :  2.000   Median : NA    
##  Mean   :0.2718   Mean   :  5.843   Mean   :NaN    
##  3rd Qu.:1.0000   3rd Qu.:  3.000   3rd Qu.: NA    
##  Max.   :1.0000   Max.   :999.000   Max.   : NA    
##                   NA's   :7864      NA's   :11933
# Helper: row-wise mean for a set of numeric columns (e.g., diet day 1 + day 2)
row_mean_numeric <- function(data, cols) {
  if (length(cols) == 0) return(rep(NA_real_, nrow(data)))
  M <- as.data.frame(lapply(
    cols,
    function(cn) suppressWarnings(as.numeric(data[[cn]]))
  ))
  rowMeans(M, na.rm = TRUE)
}

# Helper: rescale numeric vector to [0, 1] range, guarding against degenerate ranges
rescale01 <- function(x) {
  r <- range(x, na.rm = TRUE)
  if (!all(is.finite(r)) || diff(r) == 0) return(rep(NA_real_, length(x)))
  (x - r[1]) / diff(r)
}

# Identify diet recall variables (across day 1 and day 2 totals)
diet_kcal_vars <- intersect(
  names(df1),
  grep("^DR[12].*KCAL$", names(df1), ignore.case = TRUE, value = TRUE)
)

diet_sfat_vars <- intersect(
  names(df1),
  grep("^DR[12].*(SFAT|TSFAT)$", names(df1), ignore.case = TRUE, value = TRUE)
)

diet_fibe_vars <- intersect(
  names(df1),
  grep("^DR[12].*(FIBE|DFIB)$", names(df1), ignore.case = TRUE, value = TRUE)
)

# Print which diet variables were found, just for debugging / transparency
print(list(kcal = diet_kcal_vars, sfat = diet_sfat_vars, fibe = diet_fibe_vars))
## $kcal
## [1] "DR1TKCAL" "DR2TKCAL"
## 
## $sfat
## [1] "DR1TSFAT" "DR2TSFAT"
## 
## $fibe
## [1] "DR1TFIBE" "DR2TFIBE"
# Construct energy-normalized diet variables and composite diet score
df1 <- df1 %>%
  mutate(
    # Average energy, saturated fat, and fiber across available recall days
    diet_kcal = if (length(diet_kcal_vars)) {
      rowMeans(across(all_of(diet_kcal_vars)), na.rm = TRUE)
    } else NA_real_,
    diet_sfat = if (length(diet_sfat_vars)) {
      rowMeans(across(all_of(diet_sfat_vars)), na.rm = TRUE)
    } else NA_real_,
    diet_fibe = if (length(diet_fibe_vars)) {
      rowMeans(across(all_of(diet_fibe_vars)), na.rm = TRUE)
    } else NA_real_
  ) %>%
  mutate(
    # Densities (per kcal) to partially adjust for total energy intake
    satfat_dens = diet_sfat / pmax(diet_kcal, 1),
    fiber_dens  = diet_fibe / pmax(diet_kcal, 1),

    # Composite diet score: higher is "better" (low saturated fat, high fiber)
    diet_score  = rescale01(-satfat_dens) + rescale01(fiber_dens)
  )

# -------------------------------------------------------------------
# PHQ-9 total (depression severity) and Behavior Load Index (BLI)
# -------------------------------------------------------------------

# If a precomputed PHQ-9 total is not present, compute it by summing DPQ items.
if (!"dpq_sum" %in% names(df1) || all(is.na(df1$dpq_sum))) {
  dpq_items <- dict$varname[
    grepl("^DPQ0\\d+$", dict$varname) & dict$varname %in% names(df1)
  ]
  if (length(dpq_items) > 0) {
    df1 <- df1 %>%
      mutate(dpq_sum = rowSums(dplyr::across(dplyr::all_of(dpq_items)), na.rm = TRUE))
  } else {
    df1 <- df1 %>% mutate(dpq_sum = NA_real_)
  }
}

Below I define the metric of Behavior Load Index (BLI): count six risk factors into a 0–6 index # 1) current smoking # 2) heavy alcohol use # 3) short (<6h) or long (>9h) sleep # 4) low physical activity # 5) depression (PHQ-9 >= 10) # 6) low diet score (bottom third)

q_diet <- suppressWarnings(quantile(df1$diet_score, probs = 1/3, na.rm = TRUE))
diet_cut <- if (is.finite(q_diet)) q_diet else NA_real_

df1 <- df1 %>%
  mutate(
    sleep_flag = as.integer(!is.na(sleep_h) & (sleep_h < 6 | sleep_h > 9)),
    heavy_alc  = as.integer(!is.na(alc_days) & alc_days >= 4),
    low_pa     = as.integer(!is.na(pa_mod) & pa_mod < 150),
    dep_flag   = as.integer(!is.na(dpq_sum) & dpq_sum >= 10),
    low_diet   = as.integer(!is.na(diet_score) & !is.na(diet_cut) & diet_score <= diet_cut),

    # Raw BLI: sum of all risk flags, ignoring missingness within each person
    bli_raw    = rowSums(
      cbind(smk_now, heavy_alc, sleep_flag, low_pa, dep_flag, low_diet),
      na.rm = TRUE
    )
  ) %>%
  mutate(
    # Convert BLI into quintiles for plotting (Q1 = lowest load, Q5 = highest load)
    bli_q = if (all(is.na(bli_raw))) NA_integer_ else dplyr::ntile(bli_raw, 5),
    bli_q = factor(
      bli_q,
      labels = c("BLI Q1 (lowest)","Q2","Q3","Q4","BLI Q5 (highest)")
    )
  )

5. Tertiles, age bands, and tile-plot helpers

For the explorative heatmaps below, I want most patterns to be readable “at a glance,” so I compress several continuous variables into tertiles or bands:

After creating these banded variables, I define two generic helpers:

  1. tile_df()
    Takes any dataset plus:

    • an x-axis banding variable (e.g., diet tertiles),
    • a y-axis banding variable (e.g., BLI tertiles, PIR tertiles, or age band),
    • and a value variable (binary or continuous),

    and returns a summarized table with the cell count n, either percent “Yes” for binary outcomes or a mean/median for continuous outcomes, plus a human-friendly label string for plotting.

  2. tile_plot()
    Takes the tile_df() output and produces a standardized heatmap:

    • fill encodes the summary value,
    • white tile borders and bold labels improve readability,
    • axes and legend titles are kept consistent across all panels.

These helpers keep the later figure code short and readable: once the bands and helpers are defined, each new heatmap is just a call to tile_df() followed by tile_plot().

df1 <- df1 %>%
  mutate(
    # Diet quality tertiles (Low / Mid / High)
    diet_terc = case_when(
      is.na(diet_score) ~ NA_character_,
      TRUE ~ paste(c("Low","Mid","High")[ntile(diet_score, 3)])
    ) |> factor(c("Low","Mid","High"), ordered = TRUE),

    # BLI tertiles from raw 0–6 index
    bli_terc = if (!all(is.na(bli_raw))) {
      paste(c("Low","Mid","High")[ntile(bli_raw, 3)])
    } else NA_character_
  ) %>%
  mutate(
    bli_terc = factor(bli_terc, levels = c("Low","Mid","High"), ordered = TRUE),

    # PIR tertiles (income)
    pir_terc = case_when(
      !is.na(pir) ~ paste(c("Low","Mid","High")[ntile(pir, 3)]),
      !is.na(pir_grp) ~ case_when(
        pir_grp %in% c("Low PIR","Low") ~ "Low",
        pir_grp %in% c("Mid PIR","Mid") ~ "Mid",
        pir_grp %in% c("High PIR","High") ~ "High",
        TRUE ~ NA_character_
      ),
      TRUE ~ NA_character_
    ) |> factor(c("Low","Mid","High"), ordered = TRUE),

    # Age bands
    age_band = cut(age, c(18,35,50,65,Inf), right = FALSE,
                   labels = c("18–34","35–49","50–64","65+"))
  )

has_col <- function(data, nm) (nm %in% names(data)) && any(!is.na(data[[nm]]))

tile_df <- function(data, x_band, y_band, value, type = c("binary","mean","median")) {
  type <- match.arg(type)
  x_band <- rlang::ensym(x_band)
  y_band <- rlang::ensym(y_band)
  value  <- rlang::enquo(value)

  data %>%
    filter(!is.na(!!x_band), !is.na(!!y_band)) %>%
    group_by(!!y_band, !!x_band) %>%
    summarise(
      n = sum(!is.na(!!value)),
      pct_yes = if (type == "binary") 100 * mean((!!value) == 1, na.rm = TRUE) else NA_real_,
      mean_val = if (type == "mean") mean((!!value), na.rm = TRUE) else NA_real_,
      med_val  = if (type == "median") median((!!value), na.rm = TRUE) else NA_real_,
      .groups = "drop"
    ) %>%
    mutate(
      fill_val = dplyr::coalesce(pct_yes, mean_val, med_val),
      label = case_when(
        !is.na(pct_yes) ~ paste0(round(pct_yes), "%"),
        !is.na(mean_val) ~ sprintf("%.1f", mean_val),
        !is.na(med_val)  ~ sprintf("%.1f", med_val),
        TRUE ~ ""
      )
    ) %>%
    rename(y = !!y_band, x = !!x_band)
}

tile_plot <- function(tile_dat, title, fill_label = "% Yes", limits = c(0,100)) {
  ggplot(tile_dat, aes(x, y, fill = fill_val)) +
    geom_tile(color = "white") +
    geom_text(aes(label = label), size = 3.5, fontface = "bold", color = "white") +
    scale_fill_gradient(
      low = "#eff3ff", high = "#084594",
      limits = limits, oob = scales::squish, name = fill_label
    ) +
    labs(title = title, x = "Diet quality (tertiles)", y = "Stratifier (tertiles/bands)") +
    theme_minimal(base_size = 12) +
    theme(
      panel.grid = element_blank(),
      axis.title.y = element_text(margin = margin(r = 6)),
      axis.title.x = element_text(margin = margin(t = 6)),
      legend.position = "right",
      plot.title = element_text(face = "bold")
    )
}

Data Analysis and Visualization

After data preparation, I start from a practical question: when does better diet quality actually translate into better care and cardiometabolic control, and when is it overshadowed by other forces? Diet is a modifiable behavior, but people do not eat in a vacuum—other behaviors (smoking, sleep, activity, alcohol, depression) and structural constraints (income and access) travel with diet and may blunt or amplify its effects. Understanding how diet quality operates within these broader contexts is crucial for deciding whether to prioritize diet counseling alone, multi-behavior interventions, or structural policies.

Here I focus on the interaction between diet quality and two “anchor” contexts: behavior load (BLI, a 0–6 index of co-occurring risk behaviors) and income (PIR tertiles). My working hypotheses are:
1. Within a given behavior or income stratum, higher diet quality should be associated with better prevention and control (lower A1c ≥ 7%, less uncontrolled BP, at least as good access).
2. At the same time, large gaps across BLI or PIR at a fixed diet tertile would suggest that behavior clustering and socioeconomic position dominate over diet alone.

6. Figure A: Diet quality × Behavior Load and × Income tiles

To probe these hypotheses, Figure A uses 3×3 heatmaps (diet tertiles × BLI or PIR) to summarize, for each subgroup, the prevalence of six care/outcome measures:

Each cell shows the percent of adults in that diet–context combination with the outcome, with darker tiles indicating higher prevalence. The design choice of tertiles balances resolution with stable cell sizes, and the outcome set spans the care pathway (access → prevention → diagnosis → control). Reading across each row asks “within a given BLI or income level, does moving from low to high diet quality materially change risk?”; reading down each column asks “for similar diet quality, how much do behavior load and income still shape who gets diagnosed and controlled?” The code below constructs these 3×3 tables and renders the tiles.

# helpers: pull per-day kcal, saturated fat, and fiber columns

diet_kcal_cols <- names(df1)[stringr::str_detect(names(df1), "^DR[12].*KCAL")]
diet_sfat_cols <- names(df1)[stringr::str_detect(names(df1), "^DR[12].*SFAT")]
diet_fibe_cols <- names(df1)[stringr::str_detect(names(df1), "^DR[12].*(FIBE|DFIB)")]

row_mean_numeric <- function(data, cols){
if (!length(cols)) return(rep(NA_real_, nrow(data)))
rowMeans(
lapply(cols, function(cn) suppressWarnings(as.numeric(data[[cn]]))) |> as.data.frame(),
na.rm = TRUE
)
}
rescale01 <- function(x){
r <- range(x, na.rm = TRUE)
if (!all(is.finite(r)) || diff(r) == 0) return(rep(NA_real_, length(x)))
(x - r[1]) / diff(r)
}

# recompute diet_kcal / sat fat / fiber and a combined diet_score

df1 <- df1 %>%
mutate(
diet_kcal = row_mean_numeric(cur_data(), diet_kcal_cols),
diet_sfat = row_mean_numeric(cur_data(), diet_sfat_cols),
diet_fibe = row_mean_numeric(cur_data(), diet_fibe_cols),
satfat_dens = diet_sfat / pmax(diet_kcal, 1),
fiber_dens  = diet_fibe / pmax(diet_kcal, 1),
diet_score  = rescale01(-satfat_dens) + rescale01(fiber_dens)
)

# cut diet_score into three ordered groups: Low / Mid / High

qs_diet <- quantile(df1$diet_score, c(0, 1/3, 2/3, 1), na.rm = TRUE)
df1$diet_q3 <- cut(
df1$diet_score,
qs_diet,
include.lowest = TRUE,
labels = c("Low diet","Mid diet","High diet")
)

Next I define a 5-level BLI band (bli5) if it does not already exist, and a 3-level PIR band (pir3). These are the y-axis stratifiers for the two panels of Figure A.

# BLI quintiles (if not already present in df1)

if (!("bli_q" %in% names(df1)) || all(is.na(df1$bli_q))) {
smk_now    <- if ("smk_now"    %in% names(df1)) df1$smk_now    else 0L
heavy_alc  <- if ("heavy_alc"  %in% names(df1)) df1$heavy_alc  else 0L
sleep_flag <- if ("sleep_flag" %in% names(df1)) df1$sleep_flag else 0L
low_pa     <- if ("low_pa"     %in% names(df1)) df1$low_pa     else 0L
dep_flag   <- if ("dep_flag"   %in% names(df1)) df1$dep_flag   else 0L

# low diet component: lowest third of diet_score if needed

qd <- suppressWarnings(quantile(df1$diet_score, 1/3, na.rm = TRUE))
low_diet  <- if ("low_diet" %in% names(df1)) {
df1$low_diet
} else {
as.integer(!is.na(df1$diet_score) & df1$diet_score <= qd)
}

df1$bli_raw <- rowSums(
cbind(smk_now, heavy_alc, sleep_flag, low_pa, dep_flag, low_diet),
na.rm = TRUE
)
qs_bli <- quantile(df1$bli_raw, seq(0, 1, 0.2), na.rm = TRUE)
df1$bli5 <- cut(
df1$bli_raw,
qs_bli,
include.lowest = TRUE,
labels = c("BLI Q1 (lowest)","Q2","Q3","Q4","BLI Q5 (highest)")
)
} else {

# if bli_q already exists, just reuse it as bli5

df1$bli5 <- df1$bli_q
}
df1$bli5 <- forcats::fct_inorder(df1$bli5)

# reorder levels so "highest burden" appears at the top of the tile grid

bli_levels <- c("BLI Q5 (highest)","Q4","Q3","Q2","BLI Q1 (lowest)")
df1$bli5 <- factor(df1$bli5, levels = bli_levels)

# PIR tertiles: either reuse pir_grp or cut numeric PIR directly

if ("pir_grp" %in% names(df1) && !all(is.na(df1$pir_grp))) {
df1$pir3 <- factor(df1$pir_grp, levels = c("Low PIR","Mid PIR","High PIR"))
} else {
qs_pir <- quantile(df1$pir, c(0, 1/3, 2/3, 1), na.rm = TRUE)
df1$pir3 <- cut(
df1$pir,
qs_pir,
include.lowest = TRUE,
labels = c("Low PIR","Mid PIR","High PIR")
)
}

With diet_q3, bli5, and pir3 defined, I now create a helper make_2way() that:

restricts to complete cases of the x and y banding variables,

defines binary outcome flags for each care/outcome measure,

computes the cell size n and mean prevalence for each outcome,

fills in any missing tile combinations so the grid is always 3×3.

I then build two summary tables: TAB_DIET_BLI (diet × BLI) and TAB_DIET_PIR (diet × PIR).

make_2way <- function(data, xvar, yvar){
data %>%
filter(!is.na(.data[[xvar]]), !is.na(.data[[yvar]])) %>%
mutate(
access = as.integer(access %in% 1),
prev_bt = as.integer(btest  %in% 1),
dx_htn  = as.integer(dx_htn %in% 1),
dx_dm   = as.integer(dx_dm  %in% 1),
a1c_hi  = as.integer(!is.na(a1c) & a1c >= 7),
bp_uncontrolled = as.integer(
!is.na(sbp) & !is.na(dbp) & (sbp >= 140 | dbp >= 90)
)
) %>%
group_by(.data[[yvar]], .data[[xvar]]) %>%
summarise(
n         = dplyr::n(),
p_access  = mean(access,          na.rm = TRUE),
p_prev_bt = mean(prev_bt,         na.rm = TRUE),
p_htn     = mean(dx_htn,          na.rm = TRUE),
p_dm      = mean(dx_dm,           na.rm = TRUE),
p_bp_unc  = mean(bp_uncontrolled, na.rm = TRUE),
p_a1c7    = mean(a1c_hi,          na.rm = TRUE),
.groups   = "drop"
) %>%
# standardize x/y column names for plotting
rename(y = 1, x = 2) %>%
mutate(
x = forcats::fct_relevel(x, "Low diet","Mid diet","High diet"),
y = if (identical(yvar, "bli5")) {
factor(y, levels = bli_levels)
} else {
forcats::fct_inorder(y)
}
) %>%
# explicitly complete the 3×3 grid so tiles are never missing
tidyr::complete(
x = factor(
c("Low diet","Mid diet","High diet"),
levels = c("Low diet","Mid diet","High diet")
),
y = if (identical(yvar, "bli5")) {
factor(bli_levels, levels = bli_levels)
} else {
y
},
fill = list(
n         = 0L,
p_access  = NA_real_,
p_prev_bt = NA_real_,
p_htn     = NA_real_,
p_dm      = NA_real_,
p_bp_unc  = NA_real_,
p_a1c7    = NA_real_
)
)
}

# Construct the two 3×3 tables used in Figure A

TAB_DIET_BLI <- make_2way(df1, "diet_q3", "bli5")
TAB_DIET_PIR <- make_2way(df1, "diet_q3", "pir3")

Finally, I define a plotting helper mk_tile() that converts one of these tables into a 6-panel layout (one panel per outcome) and a wrapper panel_from_tab() that arranges the panels using patchwork. The panel_diet_bli and panel_diet_pir objects are the final Figure A panels.

library(patchwork)

palette_low  <- "#EEF2F7"
palette_high <- "#1F3A8A"
fmt_pct <- scales::percent_format(accuracy = 1)

mk_tile <- function(dat, value, title){
dd <- dat %>%
mutate(
val = .data[[value]],
lab = ifelse(is.na(val), "", fmt_pct(val))
)

# row-level sample size labels (n=…) on the right margin

row_lab <- dd %>%
group_by(y) %>%
summarise(nn = sum(n, na.rm = TRUE), .groups = "drop")

ggplot(dd, aes(x = x, y = y, fill = val)) +
geom_tile(color = "white", linewidth = 0.6) +
geom_text(
aes(label = lab),
size = 3.2, fontface = "bold", color = "white"
) +
geom_text(
data = row_lab,
aes(x = 3.55, y = y, label = paste0("n=", scales::comma(nn))),
inherit.aes = FALSE,
hjust = 0, vjust = 0.5,
size = 3.1, color = "grey20"
) +
scale_fill_gradient(
limits   = c(0, 1),
low      = palette_low,
high     = palette_high,
na.value = "grey85",
name     = "% Yes",
labels   = scales::percent_format(accuracy = 1)
) +
labs(
x     = "Diet quality (tertiles)",
y     = NULL,
title = title
) +
coord_cartesian(xlim = c(1, 3.9), clip = "off") +
theme_minimal(base_size = 12) +
theme(
panel.grid   = element_blank(),
plot.title   = element_text(face = "bold", hjust = 0.5),
axis.title.y = element_text(margin = margin(r = 6)),
plot.margin  = margin(t = 6, r = 36, b = 6, l = 6)
)
}

panel_from_tab <- function(tab, big_title){
p1 <- mk_tile(tab, "p_access",  "Access (seen clinician)")
p2 <- mk_tile(tab, "p_prev_bt", "Prevention (blood test 12m)")
p3 <- mk_tile(tab, "p_htn",     "Dx: Hypertension (ever told)")
p4 <- mk_tile(tab, "p_dm",      "Dx: Diabetes (ever told)")
p5 <- mk_tile(tab, "p_bp_unc",  "Uncontrolled BP (SBP≥140 / DBP≥90)")
p6 <- mk_tile(tab, "p_a1c7",    "A1c ≥ 7%")

(p1 | p2 | p3) /
(p4 | p5 | p6) +
plot_annotation(
title = big_title,
theme = theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5)
)
)
}

panel_diet_bli <- panel_from_tab(
TAB_DIET_BLI,
"Diet Quality × Behavior Load (BLI) outcome tiles"
)
panel_diet_pir <- panel_from_tab(
TAB_DIET_PIR,
"Diet Quality × Income (PIR) outcome tiles"
)

# Optionally print in the knitted document:

print(panel_diet_bli)

print(panel_diet_pir)

Numeric Summaries: To complement the heatmaps, I tabulate the underlying cell counts and prevalence, then compute simple contrasts between low and high diet quality within each behavior load (or income) stratum. I also run an ordinal trend test treating diet tertile as a 1–3 score.

# 10.1 Numeric summaries for Figure A tiles 

# Ensure global binary flags exist
df1 <- df1 %>%
  mutate(
    bp_uncontrolled = as.integer(!is.na(sbp) & !is.na(dbp) & (sbp >= 140 | dbp >= 90)),
    a1c_hi         = as.integer(!is.na(a1c) & a1c >= 7)
  )

# Helper to make a long cell table from a TAB_* object
make_figA_cells <- function(tab, strat_label) {
  tab %>%
    dplyr::select(y, x, n, dplyr::starts_with("p_")) %>%
    tidyr::pivot_longer(
      cols      = dplyr::starts_with("p_"),
      names_to  = "outcome_short",
      values_to = "prop"
    ) %>%
    dplyr::mutate(
      outcome = dplyr::recode(
        outcome_short,
        p_access  = "Access (seen clinician, 12m)",
        p_prev_bt = "Prevention (blood test, 12m)",
        p_htn     = "Dx: Hypertension (ever told)",
        p_dm      = "Dx: Diabetes (ever told)",
        p_bp_unc  = "Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90)",
        p_a1c7    = "A1c ≥ 7%"
      ),
      pct     = 100 * prop,
      n_yes   = round(prop * n),
      diet    = x,
      strata  = y,
      strat_type = strat_label
    ) %>%
    dplyr::select(strat_type, strata, diet, outcome, n, n_yes, pct) %>%
    dplyr::arrange(strat_type, outcome, strata, diet)
}

figA_cells_bli <- make_figA_cells(TAB_DIET_BLI, "BLI quintile")
figA_cells_pir <- make_figA_cells(TAB_DIET_PIR, "PIR tertile")

# Example: inspect a subset of the table in the knitted doc
knitr::kable(
  dplyr::slice_head(figA_cells_bli, n = 18),
  digits  = 1,
  caption = "Selected cells from Figure A (Diet × BLI): n and % with each outcome."
)
Selected cells from Figure A (Diet × BLI): n and % with each outcome.
strat_type strata diet outcome n n_yes pct
BLI quintile BLI Q5 (highest) Low diet A1c ≥ 7% 1246 120 9.6
BLI quintile BLI Q5 (highest) Mid diet A1c ≥ 7% 368 26 7.1
BLI quintile BLI Q5 (highest) High diet A1c ≥ 7% 296 20 6.8
BLI quintile Q4 Low diet A1c ≥ 7% 804 30 3.7
BLI quintile Q4 Mid diet A1c ≥ 7% 406 28 6.9
BLI quintile Q4 High diet A1c ≥ 7% 384 42 10.9
BLI quintile Q3 Low diet A1c ≥ 7% 182 7 3.8
BLI quintile Q3 Mid diet A1c ≥ 7% 475 22 4.6
BLI quintile Q3 High diet A1c ≥ 7% 495 19 3.8
BLI quintile Q2 Low diet A1c ≥ 7% 0 NA NA
BLI quintile Q2 Mid diet A1c ≥ 7% 501 8 1.6
BLI quintile Q2 High diet A1c ≥ 7% 523 13 2.5
BLI quintile BLI Q1 (lowest) Low diet A1c ≥ 7% 0 NA NA
BLI quintile BLI Q1 (lowest) Mid diet A1c ≥ 7% 482 17 3.5
BLI quintile BLI Q1 (lowest) High diet A1c ≥ 7% 534 22 4.1
BLI quintile BLI Q5 (highest) Low diet Access (seen clinician, 12m) 1246 214 17.2
BLI quintile BLI Q5 (highest) Mid diet Access (seen clinician, 12m) 368 92 25.0
BLI quintile BLI Q5 (highest) High diet Access (seen clinician, 12m) 296 62 20.9
knitr::kable(
  dplyr::slice_head(figA_cells_pir, n = 18),
  digits  = 1,
  caption = "Selected cells from Figure A (Diet × PIR): n and % with each outcome."
)
Selected cells from Figure A (Diet × PIR): n and % with each outcome.
strat_type strata diet outcome n n_yes pct
PIR tertile Low PIR Low diet A1c ≥ 7% 641 53 8.3
PIR tertile Low PIR Mid diet A1c ≥ 7% 604 29 4.8
PIR tertile Low PIR High diet A1c ≥ 7% 563 47 8.3
PIR tertile Mid PIR Low diet A1c ≥ 7% 726 59 8.1
PIR tertile Mid PIR Mid diet A1c ≥ 7% 677 31 4.6
PIR tertile Mid PIR High diet A1c ≥ 7% 588 30 5.1
PIR tertile High PIR Low diet A1c ≥ 7% 615 32 5.2
PIR tertile High PIR Mid diet A1c ≥ 7% 691 26 3.8
PIR tertile High PIR High diet A1c ≥ 7% 812 22 2.7
PIR tertile Low PIR Low diet Access (seen clinician, 12m) 641 109 17.0
PIR tertile Low PIR Mid diet Access (seen clinician, 12m) 604 118 19.5
PIR tertile Low PIR High diet Access (seen clinician, 12m) 563 88 15.6
PIR tertile Mid PIR Low diet Access (seen clinician, 12m) 726 93 12.8
PIR tertile Mid PIR Mid diet Access (seen clinician, 12m) 677 91 13.4
PIR tertile Mid PIR High diet Access (seen clinician, 12m) 588 72 12.2
PIR tertile High PIR Low diet Access (seen clinician, 12m) 615 101 16.4
PIR tertile High PIR Mid diet Access (seen clinician, 12m) 691 116 16.8
PIR tertile High PIR High diet Access (seen clinician, 12m) 812 119 14.7

Interpretations: In the BLI heatmap cells, adults in the worst behavior-load group (BLI Q5) show a clear A1c gradient by diet—A1c ≥ 7% falls from ~9.6% in low diet to ~7% in mid/high diet—while access to care within this high-risk group is actually highest in the mid-diet tertile (~25% vs ~17–21%). Across PIR strata, mid-diet groups generally have the lowest A1c ≥ 7% prevalence (e.g., 4.6–4.8% vs ~8% in low diet), and in high-income adults the combination of high PIR and high diet quality yields the lowest A1c risk (~2.7%) with only modest differences in recent clinician contact across diet tertiles (roughly 12–20%).

Within each BLI quintile (or PIR tertile), I then compare outcome prevalence in the high versus low diet tertiles. This puts a numeric effect size under each pair of tiles in the heatmaps.

# Contrast High vs Low diet within each stratum --------------------------

contrast_high_low <- function(cells_tbl) {
  cells_tbl %>%
    dplyr::filter(diet %in% c("Low diet","High diet")) %>%
    dplyr::select(strat_type, strata, outcome, diet, pct) %>%
    tidyr::pivot_wider(names_from = diet, values_from = pct) %>%
    dplyr::mutate(
      diff_high_minus_low = `High diet` - `Low diet`
    ) %>%
    dplyr::arrange(strat_type, outcome, strata)
}

diet_contrast_bli <- contrast_high_low(figA_cells_bli)
diet_contrast_pir <- contrast_high_low(figA_cells_pir)

knitr::kable(
  diet_contrast_bli,
  digits  = 1,
  caption = "Figure A (Diet × BLI): difference in prevalence (High – Low diet) within each BLI quintile."
)
Figure A (Diet × BLI): difference in prevalence (High – Low diet) within each BLI quintile.
strat_type strata outcome Low diet High diet diff_high_minus_low
BLI quintile BLI Q5 (highest) A1c ≥ 7% 9.6 6.8 -2.9
BLI quintile Q4 A1c ≥ 7% 3.7 10.9 7.2
BLI quintile Q3 A1c ≥ 7% 3.8 3.8 0.0
BLI quintile Q2 A1c ≥ 7% NA 2.5 NA
BLI quintile BLI Q1 (lowest) A1c ≥ 7% NA 4.1 NA
BLI quintile BLI Q5 (highest) Access (seen clinician, 12m) 17.2 20.9 3.8
BLI quintile Q4 Access (seen clinician, 12m) 12.4 17.4 5.0
BLI quintile Q3 Access (seen clinician, 12m) 12.1 14.3 2.3
BLI quintile Q2 Access (seen clinician, 12m) NA 11.5 NA
BLI quintile BLI Q1 (lowest) Access (seen clinician, 12m) NA 9.9 NA
BLI quintile BLI Q5 (highest) Dx: Diabetes (ever told) 16.7 17.2 0.5
BLI quintile Q4 Dx: Diabetes (ever told) 7.7 15.6 7.9
BLI quintile Q3 Dx: Diabetes (ever told) 8.8 6.3 -2.5
BLI quintile Q2 Dx: Diabetes (ever told) NA 5.5 NA
BLI quintile BLI Q1 (lowest) Dx: Diabetes (ever told) NA 6.6 NA
BLI quintile BLI Q5 (highest) Dx: Hypertension (ever told) 39.5 42.2 2.7
BLI quintile Q4 Dx: Hypertension (ever told) 19.8 40.1 20.3
BLI quintile Q3 Dx: Hypertension (ever told) 18.7 24.8 6.2
BLI quintile Q2 Dx: Hypertension (ever told) NA 19.7 NA
BLI quintile BLI Q1 (lowest) Dx: Hypertension (ever told) NA 18.4 NA
BLI quintile BLI Q5 (highest) Prevention (blood test, 12m) 35.3 42.2 6.9
BLI quintile Q4 Prevention (blood test, 12m) 24.1 40.1 16.0
BLI quintile Q3 Prevention (blood test, 12m) 22.5 33.5 11.0
BLI quintile Q2 Prevention (blood test, 12m) NA 33.3 NA
BLI quintile BLI Q1 (lowest) Prevention (blood test, 12m) NA 31.5 NA
BLI quintile BLI Q5 (highest) Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) 17.0 16.6 -0.5
BLI quintile Q4 Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) 10.0 18.5 8.5
BLI quintile Q3 Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) 8.8 11.7 2.9
BLI quintile Q2 Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) NA 11.7 NA
BLI quintile BLI Q1 (lowest) Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) NA 11.2 NA
knitr::kable(
  diet_contrast_pir,
  digits  = 1,
  caption = "Figure A (Diet × PIR): difference in prevalence (High – Low diet) within each PIR tertile."
)
Figure A (Diet × PIR): difference in prevalence (High – Low diet) within each PIR tertile.
strat_type strata outcome Low diet High diet diff_high_minus_low
PIR tertile Low PIR A1c ≥ 7% 8.3 8.3 0.1
PIR tertile Mid PIR A1c ≥ 7% 8.1 5.1 -3.0
PIR tertile High PIR A1c ≥ 7% 5.2 2.7 -2.5
PIR tertile Low PIR Access (seen clinician, 12m) 17.0 15.6 -1.4
PIR tertile Mid PIR Access (seen clinician, 12m) 12.8 12.2 -0.6
PIR tertile High PIR Access (seen clinician, 12m) 16.4 14.7 -1.8
PIR tertile Low PIR Dx: Diabetes (ever told) 14.5 13.7 -0.8
PIR tertile Mid PIR Dx: Diabetes (ever told) 13.2 9.4 -3.9
PIR tertile High PIR Dx: Diabetes (ever told) 11.4 5.4 -6.0
PIR tertile Low PIR Dx: Hypertension (ever told) 29.5 29.5 0.0
PIR tertile Mid PIR Dx: Hypertension (ever told) 32.0 27.6 -4.4
PIR tertile High PIR Dx: Hypertension (ever told) 30.2 24.5 -5.7
PIR tertile Low PIR Prevention (blood test, 12m) 22.5 26.6 4.2
PIR tertile Mid PIR Prevention (blood test, 12m) 29.2 33.8 4.6
PIR tertile High PIR Prevention (blood test, 12m) 40.2 44.2 4.0
PIR tertile Low PIR Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) 12.9 16.5 3.6
PIR tertile Mid PIR Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) 15.2 13.6 -1.5
PIR tertile High PIR Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) 13.2 11.2 -2.0

In the BLI heatmap cells, adults in the worst behavior-load group (BLI Q5) show a clear A1c gradient by diet—A1c ≥ 7% falls from ~9.6% in low diet to ~7% in mid/high diet—while access to care within this high-risk group is actually highest in the mid-diet tertile (~25% vs ~17–21%). Across PIR strata, mid-diet groups generally have the lowest A1c ≥ 7% prevalence (e.g., 4.6–4.8% vs ~8% in low diet), and in high-income adults the combination of high PIR and high diet quality yields the lowest A1c risk (~2.7%) with only modest differences in recent clinician contact across diet tertiles (roughly 12–20%).

Figure A (continued): Diet × Age/Sex/Race tiles

To further examine heterogeneity, I replicate the tile structure for diet quality × age, × sex, and × race/ethnicity. The logic is: Compress continuous or coded variables into a small number of interpretable groups (3 age bands, 2 sex levels, 5+ race/ethnicity categories). Reuse the same make_2way() + panel_from_tab() machinery from Figure A to compute cell-wise prevalences. Generate three separate panels: diet × age, diet × sex, and diet × race/ethnicity.

# --- 7.1 Age bands: compress RIDAGEYR into three adult age groups ---

df1 <- df1 %>%
mutate(
age3 = case_when(
!is.na(RIDAGEYR) & RIDAGEYR >= 18  & RIDAGEYR < 35 ~ "Younger (18–34)",
!is.na(RIDAGEYR) & RIDAGEYR >= 35  & RIDAGEYR < 65 ~ "Middle (35–64)",
!is.na(RIDAGEYR) & RIDAGEYR >= 65                  ~ "Older (65+)"
),
age3 = factor(
age3,
levels = c("Younger (18–34)","Middle (35–64)","Older (65+)")
)
)

# Build the diet × age tile table and enforce a consistent factor order

TAB_DIET_AGE <- make_2way(df1, "diet_q3", "age3") %>%
mutate(
x = forcats::fct_relevel(x, "Low diet","Mid diet","High diet"),
y = forcats::fct_relevel(y, "Younger (18–34)","Middle (35–64)","Older (65+)")
) %>%

# make sure all combinations of age band × diet tercile are present

tidyr::complete(y = levels(y), x = levels(x)) %>%
arrange(y, x) %>%
group_by(y) %>%
mutate(
# optional: per-row sample size, if needed later for annotation
n_row = sum(replace_na(n, 0))
) %>%
ungroup()

panel_diet_age <- panel_from_tab(
TAB_DIET_AGE,
"Diet Quality × Age outcome tiles"
)

# --- 7.2 Sex: simple Female / Male factor for tile plots ---

df1$sex2 <- factor(df1$sex, levels = c("Female","Male"))

# --- 7.3 Race/ethnicity: collapse RIDRETH2/3 into a 5+ group variable ---

# Identify which RIDRETH* column is present, preferring vars$race if defined

cand_cols <- c("RIDRETH3","RIDRETH2","RIDRETH1")
if (exists("vars") && !is.null(vars$race) && !is.na(vars$race)) {
cand_cols <- c(vars$race, cand_cols)
}
rid_col <- intersect(cand_cols, names(df1))
rid_col <- if (length(rid_col)) rid_col[1] else NA_character_

# Read the numeric race code from the chosen RIDRETH* column

race_code <- rep(NA_real_, nrow(df1))
if (!is.na(rid_col)) {
race_code <- suppressWarnings(as.numeric(df1[[rid_col]]))
}

# Map NHANES race/ethnicity codes into human-readable labels

race_lbl <- dplyr::case_when(
race_code == 1 ~ "Mexican American",
race_code == 2 ~ "Other Hispanic",
race_code == 3 ~ "NH White",
race_code == 4 ~ "NH Black",
race_code == 6 ~ "NH Asian",
race_code == 7 ~ "Other/Multiracial",
TRUE ~ NA_character_
)

# Store as a factor with a consistent ordering for tiles

df1$race5 <- factor(
race_lbl,
levels = c(
"NH White","NH Black","NH Asian",
"Mexican American","Other Hispanic","Other/Multiracial"
)
)

# --- 7.4 Build diet × Age/Sex/Race tables and panels using the same machinery as Figure A ---

TAB_DIET_AGE  <- make_2way(df1, "diet_q3", "age3")
TAB_DIET_SEX  <- make_2way(df1, "diet_q3", "sex2")
TAB_DIET_RACE <- make_2way(df1, "diet_q3", "race5")

panel_diet_age  <- panel_from_tab(TAB_DIET_AGE,  "Diet Quality × Age outcome tiles")
panel_diet_sex  <- panel_from_tab(TAB_DIET_SEX,  "Diet Quality × Sex outcome tiles")
panel_diet_race <- panel_from_tab(TAB_DIET_RACE, "Diet Quality × Race/Ethnicity outcome tiles")

# Print the three demographic panels

print(panel_diet_age)

print(panel_diet_sex)

print(panel_diet_race)

## Selected cells from Figure A (Diet × Age): n and % with each outcome
TAB_AGE_A1C <- TAB_DIET_AGE %>%
  transmute(
    strat_type = "Age band",
    strata     = as.character(y),
    diet       = as.character(x),
    outcome    = "A1c ≥ 7%",
    n          = n,
    n_yes      = round(p_a1c7 * n),
    pct        = round(100 * p_a1c7, 1)
  )

TAB_AGE_ACCESS <- TAB_DIET_AGE %>%
  transmute(
    strat_type = "Age band",
    strata     = as.character(y),
    diet       = as.character(x),
    outcome    = "Access (seen clinician, 12m)",
    n          = n,
    n_yes      = round(p_access * n),
    pct        = round(100 * p_access, 1)
  )

TAB_DIET_AGE_NUM <- bind_rows(TAB_AGE_A1C, TAB_AGE_ACCESS)
print(knitr::kable(TAB_DIET_AGE_NUM))
## 
## 
## |strat_type |strata          |diet      |outcome                      |   n| n_yes|  pct|
## |:----------|:---------------|:---------|:----------------------------|---:|-----:|----:|
## |Age band   |Younger (18–34) |Low diet  |A1c ≥ 7%                     | 333|     6|  1.8|
## |Age band   |Middle (35–64)  |Low diet  |A1c ≥ 7%                     | 752|    75| 10.0|
## |Age band   |Older (65+)     |Low diet  |A1c ≥ 7%                     | 560|    76| 13.6|
## |Age band   |Younger (18–34) |Mid diet  |A1c ≥ 7%                     | 358|     6|  1.7|
## |Age band   |Middle (35–64)  |Mid diet  |A1c ≥ 7%                     | 722|    51|  7.1|
## |Age band   |Older (65+)     |Mid diet  |A1c ≥ 7%                     | 517|    44|  8.5|
## |Age band   |Younger (18–34) |High diet |A1c ≥ 7%                     | 298|     1|  0.3|
## |Age band   |Middle (35–64)  |High diet |A1c ≥ 7%                     | 819|    69|  8.4|
## |Age band   |Older (65+)     |High diet |A1c ≥ 7%                     | 627|    46|  7.3|
## |Age band   |Younger (18–34) |Low diet  |Access (seen clinician, 12m) | 333|    81| 24.3|
## |Age band   |Middle (35–64)  |Low diet  |Access (seen clinician, 12m) | 752|   136| 18.1|
## |Age band   |Older (65+)     |Low diet  |Access (seen clinician, 12m) | 560|    40|  7.1|
## |Age band   |Younger (18–34) |Mid diet  |Access (seen clinician, 12m) | 358|    82| 22.9|
## |Age band   |Middle (35–64)  |Mid diet  |Access (seen clinician, 12m) | 722|   142| 19.7|
## |Age band   |Older (65+)     |Mid diet  |Access (seen clinician, 12m) | 517|    42|  8.1|
## |Age band   |Younger (18–34) |High diet |Access (seen clinician, 12m) | 298|    77| 25.8|
## |Age band   |Middle (35–64)  |High diet |Access (seen clinician, 12m) | 819|   133| 16.2|
## |Age band   |Older (65+)     |High diet |Access (seen clinician, 12m) | 627|    56|  8.9|
## Figure A (Diet × Age): difference in prevalence (High – Low diet) within each age band
AGE_DIFF <- TAB_DIET_AGE %>%
  select(y, x, p_a1c7, p_access) %>%
  mutate(
    a1c_pct    = 100 * p_a1c7,
    access_pct = 100 * p_access
  ) %>%
  select(-p_a1c7, -p_access) %>%
  pivot_longer(
    cols = c(a1c_pct, access_pct),
    names_to = "metric", values_to = "pct"
  ) %>%
  filter(x %in% c("Low diet","High diet")) %>%
  mutate(diet = x) %>%
  select(-x) %>%
  pivot_wider(names_from = diet, values_from = pct) %>%
  mutate(
    diff_high_minus_low = `High diet` - `Low diet`,
    strat_type = "Age band",
    outcome    = recode(metric,
                        a1c_pct    = "A1c ≥ 7%",
                        access_pct = "Access (seen clinician, 12m)")
  ) %>%
  select(strat_type, strata = y, outcome,
         `Low diet`, `High diet`, diff_high_minus_low)

print(knitr::kable(AGE_DIFF, digits = 1))
## 
## 
## |strat_type |strata          |outcome                      | Low diet| High diet| diff_high_minus_low|
## |:----------|:---------------|:----------------------------|--------:|---------:|-------------------:|
## |Age band   |Younger (18–34) |A1c ≥ 7%                     |      1.8|       0.3|                -1.5|
## |Age band   |Younger (18–34) |Access (seen clinician, 12m) |     24.3|      25.8|                 1.5|
## |Age band   |Middle (35–64)  |A1c ≥ 7%                     |     10.0|       8.4|                -1.5|
## |Age band   |Middle (35–64)  |Access (seen clinician, 12m) |     18.1|      16.2|                -1.8|
## |Age band   |Older (65+)     |A1c ≥ 7%                     |     13.6|       7.3|                -6.2|
## |Age band   |Older (65+)     |Access (seen clinician, 12m) |      7.1|       8.9|                 1.8|

Within age bands, the tiles and difference table agree that age dominates the gradients, with diet adding only small corrections. For A1c≥7%, prevalence rises sharply from younger to older adults (e.g., 2–3% in 18–34 vs ~10–14% in 35–64 and 65+), while moving from low to high diet quality mostly nudges rates down by 1–2 percentage points, and more noticeably in the oldest group (13.6%→7.3%, –6.2 pp). Access shows the opposite pattern: younger adults have much higher visit rates overall (≈24–26%) than older adults (≈7–9%), and within each age band the low– vs high-diet differences are tiny (–1.8 to +1.8 pp), mirroring the almost flat color tiles across diet tertiles.

Summary: Putting BLI, PIR, and age together, Figure A plus these tables suggest that behavior load, income, and age are the primary axes of variation, while diet tertiles behave more like a modest modifier than a dominant driver. High-diet groups tend to have slightly better access and somewhat lower A1c/diabetes risk in specific strata (especially high-BLI and older adults), but the tiles and numbers both show that these effects are small compared with the large between-strata gaps—supporting the idea that diet quality is partly tagging broader patterns of healthcare contact and underlying risk rather than cleanly separating “healthy” and “unhealthy” cardiometabolic profiles on its own.

Panel B: Diet × Income tiles stratified by age

In Panel A we saw that income and overall behavior load largely set the baseline for diagnosis and control, with diet tertiles adding only modest corrections. One obvious follow-up is whether these patterns are stable across the life course, especially for access and preventive blood testing, where age strongly shapes insurance coverage, screening guidelines, and chances to interact with the health system.

Panel B therefore repeats the Diet × PIR tiles separately for younger (18–34), middle-aged (35–64), and older (65+) adults. Within each age band we look at the same outcomes as in Panel A, but we pay particular attention to the Access (seen clinician, 12m) and Prevention (blood test, 12m) tiles, asking whether age modifies the income gradient or the added effect of diet quality on getting into care and being screened.

make_diet_pir_tab <- function(data_subset){
  make_2way(data_subset, "diet_q3", "pir3") %>%
    mutate(
      x = forcats::fct_relevel(x, "Low diet","Mid diet","High diet"),
      y = forcats::fct_relevel(y, "Low PIR","Mid PIR","High PIR")
    ) %>%
    tidyr::complete(y = levels(y), x = levels(x)) %>%
    arrange(y, x) %>%
    group_by(y) %>%
    mutate(n_row = sum(replace_na(n, 0))) %>%
    ungroup()
}

panel_diet_pir_by_age <- function(age_label){
  dsub <- df1 %>% filter(age3 == age_label)
  tab  <- make_diet_pir_tab(dsub)
  panel_from_tab(tab, paste0("Diet Quality × Income (PIR) outcome tiles — ", age_label))
}

p_age_18_34 <- panel_diet_pir_by_age("Younger (18–34)")
p_age_35_64 <- panel_diet_pir_by_age("Middle (35–64)")
p_age_65p   <- panel_diet_pir_by_age("Older (65+)")

# Optionally print:
print(p_age_18_34)

print(p_age_35_64)

print(p_age_65p)

Panel B: what we learn

Across all three age groups, the tiles confirm that income still structures access and prevention more than diet does, but the gradient changes with age. Among younger adults, access and blood testing are relatively low and noisy, with only mild PIR differences and no consistent advantage for high-diet groups, suggesting that many young adults remain outside regular care regardless of diet. In middle-aged adults, the income gradient becomes clearer: low-PIR groups show higher “access” but lower preventive testing than high-PIR peers, hinting that lower-income patients may visit clinicians for acute needs while higher-income peers are more likely to receive guideline-driven screening. By older ages, access is fairly high in all PIR bands and prevention tiles are uniformly dark, with high-PIR/high-diet cells still near the top—consistent with Medicare and age-based screening policies reducing, but not erasing, socioeconomic gaps.

Together with Panel A, Panel B suggests that structural factors (age, insurance, screening norms, income) shape who gets seen and tested, while diet quality operates more as a mild modifier within those strata rather than a primary driver of access or prevention.

Panel C Ridge plots: distributions of SBP and A1c Panels A and B focus on whether people are diagnosed, seen, or “uncontrolled” according to guideline cutoffs. But those binary outcomes sit on top of continuous risk distributions: some groups may have similar proportions of A1c ≥ 7%, for example, while still differing in their average A1c or in how heavy the high-risk tail is.

In Panel C, I therefore switch to ridge plots of SBP and A1c, stratified by the same social and behavioral axes as before (diet tertiles crossed with income or age). The question is:

Do higher diet quality scores shift the entire distribution of SBP or A1c downward within income and age strata, or do we mostly see small changes around the clinical thresholds (140/90 for BP, 7% for A1c)?

Are diet-related shifts comparable to those associated with income or age, or do the structural factors still dominate once we look at the continuous measurements?

I therefore, in Panel C, add ridge plots to visualize:

This helps distinguish whether differences are simple shifts or more complex distribution changes.

to_num <- function(x) suppressWarnings(as.numeric(x))

# If diet_ter is not already present, build tertiles of diet_score

if (!("diet_ter" %in% names(df1))) {
if (!("diet_score" %in% names(df1))) stop("diet_score missing.")
df1 <- df1 %>%
mutate(
diet_ter = ntile(diet_score, 3),
diet_ter = factor(diet_ter,
labels = c("Low diet score","Mid diet score","High diet score"))
)
}

# Ridge-plot backbone: SBP, A1c, BLI quintiles, and diet tertiles

ridg_dat <- df1 %>%
transmute(
sbp   = to_num(sbp),
a1c   = to_num(a1c),
bli_q = factor(bli_q,
levels = c("BLI Q1 (lowest)","Q2","Q3","Q4","BLI Q5 (highest)")),
diet_ter = factor(diet_ter,
levels = c("Low diet score","Mid diet score","High diet score"))
) %>%
mutate(
# Truncate to plausible clinical ranges for cleaner densities
sbp = ifelse(sbp >= 80 & sbp <= 200, sbp, NA_real_),
a1c = ifelse(a1c >= 4  & a1c <= 14,  a1c, NA_real_)
)

Next I construct four core ridge plots:

SBP by BLI quintile, A1c by BLI quintile, SBP by diet tertile, A1c by diet tertile. Each plot overlays a vertical guideline at a standard clinical threshold.

p_sbp_bli <- ggplot(
filter(ridg_dat, !is.na(sbp), !is.na(bli_q)),
aes(x = sbp, y = bli_q, fill = bli_q)
) +
ggridges::geom_density_ridges(
scale = 1.2, rel_min_height = 0.01,
alpha = 0.85, color = "grey25"
) +
geom_vline(xintercept = 140, linetype = 2, linewidth = 0.7, color = "firebrick") +
scale_fill_manual(
values = c("#e0e7ff","#c7d2fe","#a5b4fc","#818cf8","#6366f1"),
guide  = "none"
) +
labs(
title    = "SBP distributions by Behavior Load (BLI)",
subtitle = "Dashed line at 140 mmHg",
x        = "Systolic blood pressure (mmHg)",
y        = NULL
) +
theme_ridges(font_size = 12, grid = TRUE)

p_a1c_bli <- ggplot(
filter(ridg_dat, !is.na(a1c), !is.na(bli_q)),
aes(x = a1c, y = bli_q, fill = bli_q)
) +
ggridges::geom_density_ridges(
scale = 1.2, rel_min_height = 0.01,
alpha = 0.85, color = "grey25"
) +
geom_vline(xintercept = 7, linetype = 2, linewidth = 0.7, color = "firebrick") +
scale_fill_manual(
values = c("#e0f2fe","#bae6fd","#7dd3fc","#38bdf8","#0ea5e9"),
guide  = "none"
) +
labs(
title    = "A1c distributions by Behavior Load (BLI)",
subtitle = "Dashed line at 7%",
x        = "Hemoglobin A1c (%)",
y        = NULL
) +
theme_ridges(font_size = 12, grid = TRUE)

p_sbp_diet <- ggplot(
filter(ridg_dat, !is.na(sbp), !is.na(diet_ter)),
aes(x = sbp, y = diet_ter, fill = diet_ter)
) +
ggridges::geom_density_ridges(
scale = 1.2, rel_min_height = 0.01,
alpha = 0.9, color = "grey25"
) +
geom_vline(xintercept = 140, linetype = 2, linewidth = 0.7, color = "firebrick") +
scale_fill_manual(
values = c("#fee2e2","#fecaca","#fca5a5"),
guide  = "none"
) +
labs(
title    = "SBP distributions by Diet quality (tertiles)",
subtitle = "Dashed line at 140 mmHg",
x        = "Systolic blood pressure (mmHg)",
y        = NULL
) +
theme_ridges(font_size = 12, grid = TRUE)

p_a1c_diet <- ggplot(
filter(ridg_dat, !is.na(a1c), !is.na(diet_ter)),
aes(x = a1c, y = diet_ter, fill = diet_ter)
) +
ggridges::geom_density_ridges(
scale = 1.2, rel_min_height = 0.01,
alpha = 0.9, color = "grey25"
) +
geom_vline(xintercept = 7, linetype = 2, linewidth = 0.7, color = "firebrick") +
scale_fill_manual(
values = c("#dcfce7","#bbf7d0","#86efac"),
guide  = "none"
) +
labs(
title    = "A1c distributions by Diet quality (tertiles)",
subtitle = "Dashed line at 7%",
x        = "Hemoglobin A1c (%)",
y        = NULL
) +
theme_ridges(font_size = 12, grid = TRUE)

# 2×2 layout: BLI gradients on top, diet tertiles on bottom

p_ridge <- (p_sbp_bli | p_a1c_bli) / (p_sbp_diet | p_a1c_diet)

Finally, I refine factor ordering and define helper functions to build within–BLI stratum ridge panels. These panels show how SBP and A1c vary across diet tertiles within a fixed BLI quintile, making it easier to compare diet effects at a given behavior load.

# Ensure consistent ordering for BLI quintiles and diet tertiles

ridg_dat$bli_q <- fct_relevel(
ridg_dat$bli_q,
"BLI Q1 (lowest)","Q2","Q3","Q4","BLI Q5 (highest)"
)
ridg_dat$diet_ter <- fct_relevel(
ridg_dat$diet_ter,
"Low diet score","Mid diet score","High diet score"
)

# Small builders: SBP and A1c ridges by diet tertile, with flexible titles

build_sbp_plot <- function(dat, title_suffix = "") {
ggplot(dat, aes(x = sbp, y = diet_ter, fill = diet_ter)) +
ggridges::geom_density_ridges(
scale = 1.2, rel_min_height = 0.01,
alpha = 0.90, color = "grey25"
) +
geom_vline(xintercept = 140, linetype = 2, linewidth = 0.7, color = "firebrick") +
scale_fill_manual(
values = c("#fee2e2","#fecaca","#fca5a5"),
guide  = "none"
) +
labs(
title    = paste0("SBP distributions by Diet quality", title_suffix),
subtitle = "Dashed line at 140 mmHg",
x        = "Systolic blood pressure (mmHg)",
y        = NULL
) +
theme_ridges(font_size = 12, grid = TRUE)
}

build_a1c_plot <- function(dat, title_suffix = "") {
ggplot(dat, aes(x = a1c, y = diet_ter, fill = diet_ter)) +
ggridges::geom_density_ridges(
scale = 1.2, rel_min_height = 0.01,
alpha = 0.90, color = "grey25"
) +
geom_vline(xintercept = 7, linetype = 2, linewidth = 0.7, color = "firebrick") +
scale_fill_manual(
values = c("#dcfce7","#bbf7d0","#86efac"),
guide  = "none"
) +
labs(
title    = paste0("A1c distributions by Diet quality", title_suffix),
subtitle = "Dashed line at 7%",
x        = "Hemoglobin A1c (%)",
y        = NULL
) +
theme_ridges(font_size = 12, grid = TRUE)
}

# For a given BLI quintile, build a 2-panel ridge figure:

# left  = SBP × diet tertile

# right = A1c × diet tertile

make_bli_panel <- function(bli_label) {
sbp_sub <- ridg_dat %>%
filter(bli_q == bli_label, !is.na(diet_ter), !is.na(sbp))
a1c_sub <- ridg_dat %>%
filter(bli_q == bli_label, !is.na(diet_ter), !is.na(a1c))

left  <- if (nrow(sbp_sub)) {
build_sbp_plot(
sbp_sub,
paste0(" (", bli_label, "; n=", nrow(sbp_sub), ")")
)
} else NULL

right <- if (nrow(a1c_sub)) {
build_a1c_plot(
a1c_sub,
paste0(" (", bli_label, "; n=", nrow(a1c_sub), ")")
)
} else NULL

fig <- if (!is.null(left) && !is.null(right)) {
left | right
} else if (!is.null(left)) {
left
} else if (!is.null(right)) {
right
} else {
ggplot() + theme_void() +
ggtitle(paste0(bli_label, " — no data"))
}

fig + plot_annotation(
title = paste0("Diet × Clinical distributions — ", bli_label),
theme = theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
)
)
}

# Precompute a panel for each BLI quintile (can print selectively)

panel_bli_q1 <- make_bli_panel("BLI Q1 (lowest)")
panel_bli_q2 <- make_bli_panel("Q2")
panel_bli_q3 <- make_bli_panel("Q3")
panel_bli_q4 <- make_bli_panel("Q4")
panel_bli_q5 <- make_bli_panel("BLI Q5 (highest)")

print(panel_bli_q1)
## Picking joint bandwidth of 4.39
## Picking joint bandwidth of 0.114

I also mirror this conditioning for PIR tertiles, creating SBP/A1c ridge plots by diet quality within income bands.

to_num <- function(x) suppressWarnings(as.numeric(x))

if (!("diet_ter" %in% names(df1))) {
  stop("diet_ter is missing; build it from diet_score before running.")
}
df1$diet_ter <- fct_relevel(df1$diet_ter,
                            "Low diet score","Mid diet score","High diet score")

if (!("pir3" %in% names(df1)) || all(is.na(df1$pir3))) {
  if (!("pir" %in% names(df1))) stop("Need numeric PIR to build pir3.")
  qs_pir <- quantile(df1$pir, c(0, 1/3, 2/3, 1), na.rm = TRUE)
  df1$pir3 <- cut(df1$pir, qs_pir, include.lowest = TRUE,
                  labels = c("Low PIR","Mid PIR","High PIR"))
}
df1$pir3 <- fct_relevel(df1$pir3, "Low PIR","Mid PIR","High PIR")

ridg_income <- df1 %>%
  transmute(
    sbp   = to_num(sbp),
    a1c   = to_num(a1c),
    pir3  = pir3,
    diet_ter = diet_ter
  ) %>%
  mutate(
    sbp = ifelse(sbp >= 80 & sbp <= 200, sbp, NA_real_),
    a1c = ifelse(a1c >= 4  & a1c <= 14,  a1c, NA_real_)
  )

build_sbp_plot <- function(dat, title_suffix="") {
  ggplot(dat, aes(x=sbp, y=diet_ter, fill=diet_ter)) +
    ggridges::geom_density_ridges(scale=1.2, rel_min_height=0.01,
                                  alpha=0.90, color="grey25") +
    geom_vline(xintercept=140, linetype=2, linewidth=0.7, color="firebrick") +
    scale_fill_manual(values=c("#fee2e2","#fecaca","#fca5a5"), guide="none") +
    labs(title=paste0("SBP distributions by Diet quality", title_suffix),
         subtitle="Dashed line at 140 mmHg",
         x="Systolic blood pressure (mmHg)", y=NULL) +
    theme_ridges(font_size=12, grid=TRUE)
}
build_a1c_plot <- function(dat, title_suffix="") {
  ggplot(dat, aes(x=a1c, y=diet_ter, fill=diet_ter)) +
    ggridges::geom_density_ridges(scale=1.2, rel_min_height=0.01,
                                  alpha=0.90, color="grey25") +
    geom_vline(xintercept=7, linetype=2, linewidth=0.7, color="firebrick") +
    scale_fill_manual(values=c("#dcfce7","#bbf7d0","#86efac"), guide="none") +
    labs(title=paste0("A1c distributions by Diet quality", title_suffix),
         subtitle="Dashed line at 7%",
         x="Hemoglobin A1c (%)", y=NULL) +
    theme_ridges(font_size=12, grid=TRUE)
}

make_pir_panel <- function(pir_label) {
  sbp_sub <- ridg_income %>% filter(pir3 == pir_label, !is.na(diet_ter), !is.na(sbp))
  a1c_sub <- ridg_income %>% filter(pir3 == pir_label, !is.na(diet_ter), !is.na(a1c))

  left  <- if (nrow(sbp_sub))
    build_sbp_plot(sbp_sub, paste0(" (", pir_label, "; n=", nrow(sbp_sub), ")")) else NULL
  right <- if (nrow(a1c_sub))
    build_a1c_plot(a1c_sub, paste0(" (", pir_label, "; n=", nrow(a1c_sub), ")")) else NULL

  fig <- if (!is.null(left) && !is.null(right)) left | right
         else if (!is.null(left)) left
         else if (!is.null(right)) right
         else ggplot() + theme_void() + ggtitle(paste0(pir_label, " — no data"))

  fig + plot_annotation(
    title = paste0("Diet × Clinical distributions — ", pir_label),
    theme = theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
  )
}

panel_pir_low  <- make_pir_panel("Low PIR")
panel_pir_mid  <- make_pir_panel("Mid PIR")
panel_pir_high <- make_pir_panel("High PIR")


print(panel_pir_low)
## Picking joint bandwidth of 4.47
## Picking joint bandwidth of 0.133

print(panel_pir_mid)
## Picking joint bandwidth of 4.36
## Picking joint bandwidth of 0.122

print(panel_pir_high)
## Picking joint bandwidth of 3.89
## Picking joint bandwidth of 0.1

Numerical analysis: To make these visual impressions more concrete, I pair each set of ridges with simple numeric summaries: group-wise means and standard deviations, plus the percent of people above clinically relevant cutoffs (A1c ≥ 6% and ≥ 7%; SBP ≥ 140 or DBP ≥ 90). These tables give us a way to quantify whether, for example, “leaner” ridges in high-diet groups really correspond to meaningfully lower risk, or whether the distributions mostly overlap.

# Helper: summarise A1c within arbitrary grouping structure
summ_a1c <- function(dat, group_vars,
                     cut_lo = 6, cut_hi = 7) {
  dat %>%
    dplyr::filter(!is.na(a1c)) %>%
    dplyr::group_by(dplyr::across(all_of(group_vars))) %>%
    dplyr::summarise(
      n        = dplyr::n(),
      mean_a1c = mean(a1c),
      sd_a1c   = sd(a1c),
      pct_ge_lo = 100 * mean(a1c >= cut_lo),
      pct_ge_hi = 100 * mean(a1c >= cut_hi),
      .groups  = "drop"
    )
}

# Helper: summarise BP within arbitrary grouping structure
summ_bp <- function(dat, group_vars,
                    sbp_cut = 140, dbp_cut = 90) {
  dat %>%
    dplyr::filter(!is.na(sbp), !is.na(dbp)) %>%
    dplyr::group_by(dplyr::across(all_of(group_vars))) %>%
    dplyr::summarise(
      n          = dplyr::n(),
      mean_sbp   = mean(sbp),
      sd_sbp     = sd(sbp),
      mean_dbp   = mean(dbp),
      sd_dbp     = sd(dbp),
      pct_unctrl = 100 * mean(sbp >= sbp_cut | dbp >= dbp_cut),
      .groups    = "drop"
    )
}

# ---- Match the ridge-plot facets ----
# (Adjust group_vars to align with the stratification you used in Panel C.)

# Example 1: Diet × Income (PIR) ridges
a1c_pir_diet  <- summ_a1c(df1, c("pir_grp", "diet_ter"))
bp_pir_diet   <- summ_bp(df1,  c("pir_grp", "diet_ter"))

# Example 2: Diet × Age ridges
a1c_age_diet  <- summ_a1c(df1, c("age3", "diet_ter"))
bp_age_diet   <- summ_bp(df1,  c("age3", "diet_ter"))

# If you have ridges stratified by behavioral load index (BLI), you can also do:
# a1c_bli_diet <- summ_a1c(df1, c("bli_quint", "diet_ter"))
# bp_bli_diet  <- summ_bp(df1,  c("bli_quint", "diet_ter"))

# Optionally, print as compact tables in the notebook
knitr::kable(a1c_pir_diet, digits = 1,
             caption = "A1c distribution by PIR × diet tertile")
A1c distribution by PIR × diet tertile
pir_grp diet_ter n mean_a1c sd_a1c pct_ge_lo pct_ge_hi
Low PIR Low diet score 463 5.9 1.4 25.3 11.4
Low PIR Mid diet score 423 5.8 1.3 22.2 6.9
Low PIR High diet score 416 5.9 1.2 26.4 11.3
Low PIR NA 458 5.7 1.1 19.4 8.3
Mid PIR Low diet score 556 5.8 1.0 26.1 10.6
Mid PIR Mid diet score 516 5.7 1.1 19.6 6.0
Mid PIR High diet score 485 5.7 0.9 20.2 6.2
Mid PIR NA 416 5.6 1.1 18.0 5.5
High PIR Low diet score 530 5.7 0.9 20.0 6.0
High PIR Mid diet score 563 5.6 0.9 13.5 4.6
High PIR High diet score 705 5.5 0.7 12.9 3.1
High PIR NA 342 5.6 1.0 16.4 7.9
NA Low diet score 189 5.7 0.8 19.0 6.9
NA Mid diet score 190 5.8 1.0 22.1 7.9
NA High diet score 212 5.8 1.1 22.6 8.0
NA NA 251 5.8 1.2 21.5 8.4
knitr::kable(bp_pir_diet,  digits = 1,
             caption = "BP distribution by PIR × diet tertile")
BP distribution by PIR × diet tertile
pir_grp diet_ter n mean_sbp sd_sbp mean_dbp sd_dbp pct_unctrl
Low PIR Low diet score 535 119.1 18.8 72.8 12.2 15.5
Low PIR Mid diet score 488 117.8 19.8 71.5 12.6 16.0
Low PIR High diet score 459 120.2 20.5 73.1 12.1 20.3
Low PIR NA 555 116.2 19.4 70.6 12.4 12.1
Mid PIR Low diet score 627 120.4 18.0 72.6 11.2 17.5
Mid PIR Mid diet score 586 119.5 18.5 72.0 11.9 16.0
Mid PIR High diet score 530 120.3 18.4 72.2 11.1 15.1
Mid PIR NA 475 118.0 17.4 71.9 11.7 12.6
High PIR Low diet score 576 119.7 16.2 73.7 10.9 14.1
High PIR Mid diet score 618 119.5 16.4 73.4 10.7 12.8
High PIR High diet score 740 119.7 16.9 72.1 10.0 12.3
High PIR NA 381 117.7 16.9 71.8 11.7 12.9
NA Low diet score 215 119.1 18.9 71.5 11.5 15.8
NA Mid diet score 217 118.8 18.5 71.6 10.5 17.1
NA High diet score 229 119.9 17.6 71.8 10.7 15.3
NA NA 287 119.0 19.1 70.6 11.8 13.6
knitr::kable(a1c_age_diet, digits = 1,
             caption = "A1c distribution by age band × diet tertile")
A1c distribution by age band × diet tertile
age3 diet_ter n mean_a1c sd_a1c pct_ge_lo pct_ge_hi
Younger (18–34) Low diet score 299 5.3 0.6 3.3 2.0
Younger (18–34) Mid diet score 330 5.3 1.0 3.3 1.8
Younger (18–34) High diet score 284 5.2 0.3 2.1 0.4
Younger (18–34) NA 370 5.3 0.7 5.1 1.4
Middle (35–64) Low diet score 727 5.9 1.3 26.0 10.3
Middle (35–64) Mid diet score 697 5.8 1.2 21.7 7.3
Middle (35–64) High diet score 793 5.7 1.1 19.7 8.7
Middle (35–64) NA 594 5.8 1.2 22.6 8.4
Older (65+) Low diet score 536 6.1 1.0 38.2 14.2
Older (65+) Mid diet score 489 6.0 1.1 30.5 9.0
Older (65+) High diet score 604 5.9 0.9 30.3 7.6
Older (65+) NA 279 6.2 1.2 43.4 19.4
NA Low diet score 176 5.3 0.3 0.0 0.0
NA Mid diet score 176 5.2 0.3 1.1 0.0
NA High diet score 137 5.2 0.3 1.5 0.0
NA NA 224 5.2 0.3 0.0 0.0
knitr::kable(bp_age_diet,  digits = 1,
             caption = "BP distribution by age band × diet tertile")
BP distribution by age band × diet tertile
age3 diet_ter n mean_sbp sd_sbp mean_dbp sd_dbp pct_unctrl
Younger (18–34) Low diet score 323 112.6 12.0 72.0 10.0 4.6
Younger (18–34) Mid diet score 346 113.3 12.2 72.1 10.7 6.6
Younger (18–34) High diet score 291 110.8 11.7 70.4 9.9 4.1
Younger (18–34) NA 384 112.8 11.7 70.7 9.5 4.7
Middle (35–64) Low diet score 738 122.2 15.7 77.9 10.7 17.2
Middle (35–64) Mid diet score 702 121.8 17.0 77.6 10.9 16.5
Middle (35–64) High diet score 806 120.8 16.8 75.7 10.5 14.6
Middle (35–64) NA 595 122.7 16.5 77.9 10.9 17.3
Older (65+) Low diet score 547 129.5 19.8 72.4 10.9 29.8
Older (65+) Mid diet score 502 130.0 19.2 72.5 10.5 29.7
Older (65+) High diet score 614 129.5 19.1 72.9 10.6 27.2
Older (65+) NA 275 133.5 22.6 71.9 13.3 34.2
NA Low diet score 345 105.3 9.4 63.5 8.3 0.9
NA Mid diet score 359 103.6 9.4 61.7 7.3 0.0
NA High diet score 247 104.7 9.6 62.5 7.4 0.8
NA NA 444 104.6 10.0 62.4 7.8 0.0

Summary: Across all ridge plots, the big shifts in risk come from income and age, not diet tertiles. Within each PIR band, A1c and BP means are almost identical across low/mid/high diet, and the share above A1c ≥7% or uncontrolled BP changes by only a few percentage points, confirming that diet tiles in Panel A were mostly showing differential detection rather than large differences in underlying risk. By contrast, moving from younger → middle-aged → older adults shifts the whole A1c and SBP distributions rightward and roughly quintuples the proportion above clinical cutpoints, with only modest leftward nudges for high-diet groups in middle-aged and older adults. Taken together, Panel C suggests that in this cross-section, diet quality is at best a secondary modifier of cardiometabolic risk distributions once we account for age and income.

Panel D Pathway plots: Diagnosis → Treatment → Control of Diabetic Conditions

Up to this point, Panels A–C have told us two things: (1) income, age, and behavior load strongly shape who shows up with elevated A1c or BP, while diet tertiles mainly add small, non-monotone tweaks; and (2) many of the strongest gradients we saw were in access and testing, not necessarily in the underlying risk distributions themselves.

In Panel D we therefore shift from static prevalence to the care pathway: among adults with hypertension or diabetes, where exactly do disparities appear—at diagnosis, treatment uptake, or successful control? We stratify by income and diet (and then by age and diet) to ask whether “better” diet tertiles are associated with fewer people being undiagnosed, greater medication use once diagnosed, or higher rates of control conditional on diagnosis. This helps us see whether diet quality in this cross-section acts more like a marker of health-system engagement (better screening and treatment) or a modifier of biological response once on therapy, and how those patterns differ across income and age bands.

To understand the care pathway I summarize, for diabetes especially:

  1. Among all adults: percent diagnosed.
  2. Among those diagnosed: percent on medication.
  3. Among those diagnosed: percent controlled.

I first set up common helpers and derive consistent diagnosis, medication, and control flags, which will be reused for both income‐ and age‐stratified plots.

yn2   <- function(x) ifelse(as.numeric(as.character(x)) %in% 1, 1L,
ifelse(as.numeric(as.character(x)) %in% 2, 0L, NA_integer_))
to_num <- function(x) suppressWarnings(as.numeric(as.character(x)))

if (!("pir_grp" %in% names(df1)) || all(is.na(df1$pir_grp))) {
qs_pir <- quantile(df1$pir, c(0, 1/3, 2/3, 1), na.rm = TRUE)
df1$pir_grp <- cut(df1$pir, qs_pir, include.lowest = TRUE,
labels = c("Low PIR","Mid PIR","High PIR"))
}
df1$pir_grp <- forcats::fct_relevel(df1$pir_grp, "Low PIR","Mid PIR","High PIR")

stopifnot("diet_score" %in% names(df1))
if (!("diet_ter" %in% names(df1))) {
df1 <- df1 %>%
dplyr::mutate(
diet_ter = dplyr::ntile(diet_score, 3),
diet_ter = factor(diet_ter,
labels = c("Low diet score","Mid diet score","High diet score"))
)
} else {
df1$diet_ter <- forcats::fct_relevel(df1$diet_ter,
"Low diet score","Mid diet score","High diet score")
}

df1 <- df1 %>%
dplyr::mutate(
dx_htn = if ("BPQ020" %in% names(.)) yn2(BPQ020) else NA_integer_,
dx_dm  = dplyr::case_when(
"DIQ010" %in% names(.) ~ ifelse(DIQ010 %in% 1, 1L,
ifelse(DIQ010 %in% c(2,3), 0L, NA_integer_)),
TRUE ~ NA_integer_
),
htn_meds = if ("BPQ050A" %in% names(.)) yn2(BPQ050A) else NA_integer_,
dm_ins   = if ("DIQ050"  %in% names(.)) yn2(DIQ050)  else NA_integer_,
dm_pill  = if ("DIQ070"  %in% names(.)) yn2(DIQ070)  else NA_integer_,
dm_meds  = as.integer(dplyr::coalesce(dm_ins,0L) | dplyr::coalesce(dm_pill,0L)),

sbp = to_num(sbp), dbp = to_num(dbp), a1c = to_num(a1c),
htn_ctrl = ifelse(dx_htn == 1L & !is.na(sbp) & !is.na(dbp),
                  as.integer(sbp < 140 & dbp < 90), NA_integer_),
dm_ctrl  = ifelse(dx_dm  == 1L & !is.na(a1c),
                  as.integer(a1c < 7), NA_integer_)

)

df_use <- df1 %>% dplyr::filter(!is.na(pir_grp))

Next, I define a generic pathway summarizer that computes, for each PIR × diet stratum:

how many adults have usable data (n_any),

how many are diagnosed, on meds, and controlled,

and the corresponding percentages for each stage of the pathway.

summ_pathway <- function(dat, x_group = "diet_ter",
dx_var, meds_var, ctrl_var, outcome_label) {
xsym <- rlang::sym(x_group)

base_any <- dat %>%
dplyr::group_by(pir_grp, !!xsym) %>%
dplyr::summarise(
n_any   = sum(!is.na(.data[[dx_var]])),
n_total = dplyr::n(),
n_dx_any= sum(.data[[dx_var]] == 1L, na.rm = TRUE),
.groups = "drop"
)

base_dx <- dat %>%
dplyr::filter(.data[[dx_var]] == 1L) %>%
dplyr::group_by(pir_grp, !!xsym) %>%
dplyr::summarise(
n_dx   = dplyr::n(),
n_meds = sum(dplyr::coalesce(.data[[meds_var]],0L), na.rm = TRUE),
n_ctrl = sum(dplyr::coalesce(.data[[ctrl_var]],0L), na.rm = TRUE),
.groups = "drop"
)

joined <- base_any %>%
dplyr::left_join(base_dx, by = c("pir_grp", x_group)) %>%
dplyr::mutate(
n_dx   = dplyr::coalesce(n_dx,   0L),
n_meds = dplyr::coalesce(n_meds, 0L),
n_ctrl = dplyr::coalesce(n_ctrl, 0L),
pct_dx   = 100 * (n_dx_any / pmax(n_any, 1)),
pct_meds = 100 * (n_meds   / pmax(n_dx,  1)),
pct_ctrl = 100 * (n_ctrl   / pmax(n_dx,  1))
)

long <- joined %>%
tidyr::pivot_longer(
cols = c(pct_ctrl, pct_meds, pct_dx),
names_to = "stage", values_to = "pct"
) %>%
dplyr::mutate(
stage = factor(stage,
levels = c("pct_ctrl","pct_meds","pct_dx"),
labels = c("Controlled (among diagnosed)",
"On meds (among diagnosed)",
"Diagnosed (among all)")),
denom = dplyr::case_when(
stage == "Diagnosed (among all)"        ~ n_any,
stage == "On meds (among diagnosed)"    ~ n_dx,
stage == "Controlled (among diagnosed)" ~ n_dx
),
outcome = outcome_label
) %>%
dplyr::select(pir_grp, !!xsym, stage, pct, denom, outcome)

long
}

I then apply this pathway summary separately for hypertension and diabetes, and bind them for plotting. The first figure shows the 3-stage pathway by PIR and diet tertile; the second zooms into diabetes control by diet × PIR, split by medication status and making denominators explicit.

htn_path_diet <- summ_pathway(df_use, "diet_ter", "dx_htn","htn_meds","htn_ctrl","Hypertension")
dm_path_diet  <- summ_pathway(df_use, "diet_ter", "dx_dm", "dm_meds","dm_ctrl","Diabetes")
path_diet_all <- dplyr::bind_rows(htn_path_diet, dm_path_diet)

fig_path_diet <- ggplot2::ggplot(path_diet_all,
ggplot2::aes(x = stage, y = pct, fill = diet_ter)) +
ggplot2::geom_col(position = ggplot2::position_dodge(width = 0.75), width = 0.7) +
ggplot2::geom_text(
ggplot2::aes(label = sprintf("%.0f%%\n(n=%s)", pct, scales::comma(denom))),
position = ggplot2::position_dodge(width = 0.75),
vjust = -0.35, lineheight = 0.95, size = 3
) +
ggplot2::facet_grid(outcome ~ pir_grp, scales = "free_y") +
ggplot2::scale_fill_brewer(palette = "Blues", direction = -1, name = "Diet quality") +
ggplot2::labs(
title = "Diagnosis → Treatment → Control pathways by income and diet quality",
x = NULL, y = "% of adults"
) +
ggplot2::theme_minimal(base_size = 12) +
ggplot2::theme(
panel.grid.major.x = ggplot2::element_blank(),
axis.text.x = ggplot2::element_text(angle = 10, hjust = 1),
plot.title = ggplot2::element_text(face = "bold")
)

dm_ctrl_by_diet_pir <- df_use %>%
dplyr::filter(dx_dm == 1L) %>%
dplyr::mutate(med_group = ifelse(dm_meds == 1L, "Medicated", "Unmedicated")) %>%
dplyr::group_by(pir_grp, diet_ter, med_group) %>%
dplyr::summarise(
n_dx   = dplyr::n(),
n_ctrl = sum(dplyr::coalesce(dm_ctrl,0L), na.rm = TRUE),
pct_ctrl = 100 * n_ctrl / pmax(n_dx, 1),
.groups = "drop"
)

fig_dm_ctrl_diet <- ggplot2::ggplot(dm_ctrl_by_diet_pir,
ggplot2::aes(x = diet_ter, y = pct_ctrl, fill = med_group)) +
ggplot2::geom_col(position = ggplot2::position_dodge(width = 0.75), width = 0.65) +
ggplot2::geom_text(
ggplot2::aes(label = sprintf("%.0f%%\n(n=%s)", pct_ctrl, scales::comma(n_dx))),
position = ggplot2::position_dodge(width = 0.75),
vjust = -0.35, lineheight = 0.95, size = 3
) +
ggplot2::scale_fill_brewer(palette = "Blues", direction = -1, name = NULL) +
ggplot2::facet_wrap(~ pir_grp, nrow = 1) +
ggplot2::labs(
title = "Diabetes control (A1c < 7%) among diagnosed by Diet × Income, split by medication status",
x = "Diet quality (tertiles)", y = "% controlled"
) +
ggplot2::theme_minimal(base_size = 12) +
ggplot2::theme(
panel.grid.major.x = ggplot2::element_blank(),
axis.text.x = ggplot2::element_text(angle = 10, hjust = 1),
plot.title = ggplot2::element_text(face = "bold")
)

fig_path_diet

fig_dm_ctrl_diet

Pathway plots by Age × Diet

Analogously, I build diagnosis–treatment–control pathways by age group and diet tertile, and then look specifically at diabetes control by diet and age, split by medication status.

to_num <- function(x) suppressWarnings(as.numeric(as.character(x)))
yn2    <- function(x) {
z <- to_num(x)
ifelse(z %in% 1, 1L, ifelse(z %in% 2, 0L, NA_integer_))
}

if (!("age3" %in% names(df1))) {
df1 <- df1 %>%
mutate(
RIDAGEYR = if ("RIDAGEYR" %in% names(.)) to_num(RIDAGEYR) else to_num(age),
age3 = case_when(
!is.na(RIDAGEYR) & RIDAGEYR >= 18  & RIDAGEYR < 35 ~ "Younger (18–34)",
!is.na(RIDAGEYR) & RIDAGEYR >= 35  & RIDAGEYR < 65 ~ "Middle (35–64)",
!is.na(RIDAGEYR) & RIDAGEYR >= 65                   ~ "Older (65+)"
),
age3 = factor(age3, levels = c("Younger (18–34)","Middle (35–64)","Older (65+)"))
)
}

if (!("diet_ter" %in% names(df1))) {
stop("diet_ter is missing. Create tertiles from diet_score before proceeding.")
} else {
df1$diet_ter <- forcats::fct_relevel(df1$diet_ter,
"Low diet score","Mid diet score","High diet score")
}

df1 <- df1 %>%
mutate(
sbp = if ("sbp" %in% names(.)) to_num(sbp) else NA_real_,
dbp = if ("dbp" %in% names(.)) to_num(dbp) else NA_real_,
a1c = if ("a1c" %in% names(.)) to_num(a1c) else NA_real_,

dx_htn   = if ("dx_htn"   %in% names(.)) as.integer(dx_htn)   else NA_integer_,
htn_meds = if ("htn_meds" %in% names(.)) as.integer(htn_meds) else NA_integer_,
htn_ctrl = if ("htn_ctrl" %in% names(.)) as.integer(htn_ctrl) else NA_integer_,

dx_dm    = if ("dx_dm"    %in% names(.)) as.integer(dx_dm)    else NA_integer_,
dm_meds  = if ("dm_meds"  %in% names(.)) as.integer(dm_meds)  else NA_integer_,
dm_ctrl  = if ("dm_ctrl"  %in% names(.)) as.integer(dm_ctrl)  else NA_integer_


)

df_age <- df1 %>%
filter(!is.na(age3), !is.na(diet_ter)) %>%
select(age3, diet_ter,
dx_htn, htn_meds, htn_ctrl,
dx_dm,  dm_meds,  dm_ctrl,
sbp, dbp, a1c)

I reuse the same pathway logic, now grouped by age3 × diet_ter. The function structure mirrors the PIR-based version, but with age as the main stratifier.

summ_pathway <- function(dat, dx_var, meds_var, ctrl_var, outcome_label){
den_any <- dat %>%
dplyr::group_by(age3, diet_ter) %>%
dplyr::summarise(n_any = sum(!is.na(.data[[dx_var]])), .groups="drop")

den_dx <- dat %>%
dplyr::filter(.data[[dx_var]] == 1L) %>%
dplyr::group_by(age3, diet_ter) %>%
dplyr::summarise(
n_dx   = dplyr::n(),
n_meds = sum(coalesce(.data[[meds_var]],0L), na.rm = TRUE),
n_ctrl = sum(coalesce(.data[[ctrl_var]],0L), na.rm = TRUE),
.groups = "drop"
)

out <- den_any %>%
dplyr::left_join(
dat %>% dplyr::group_by(age3, diet_ter) %>%
dplyr::summarise(n_dx_any = sum(.data[[dx_var]]==1L, na.rm=TRUE), .groups="drop"),
by = c("age3","diet_ter")
) %>%
dplyr::left_join(den_dx, by = c("age3","diet_ter")) %>%
dplyr::mutate(
pct_dx   = 100 * (n_dx_any / pmax(n_any, 1)),
pct_meds = 100 * (n_meds   / pmax(n_dx,  1)),
pct_ctrl = 100 * (n_ctrl   / pmax(n_dx,  1))
) %>%
dplyr::select(age3, diet_ter, n_any, n_dx, pct_dx, pct_meds, pct_ctrl) %>%
tidyr::pivot_longer(cols = starts_with("pct_"),
names_to = "stage", values_to = "pct") %>%
dplyr::mutate(
stage = factor(stage,
levels=c("pct_ctrl","pct_meds","pct_dx"),
labels=c("Controlled (among diagnosed)",
"On meds (among diagnosed)",
"Diagnosed (among all)")),
denom = dplyr::case_when(
stage == "Controlled (among diagnosed)" ~ n_dx,
stage == "On meds (among diagnosed)"    ~ n_dx,
stage == "Diagnosed (among all)"        ~ n_any
),
outcome = outcome_label
) %>%
dplyr::select(age3, diet_ter, stage, pct, denom, outcome)

out
}

Finally, I generate the age × diet pathway figure and the diabetes control figure, again annotating bars with both percentages and denominators to make interpretation transparent.

path_htn_age <- summ_pathway(df_age, "dx_htn", "htn_meds", "htn_ctrl", "Hypertension")
path_dm_age  <- summ_pathway(df_age, "dx_dm",  "dm_meds",  "dm_ctrl",  "Diabetes")
path_all_age <- dplyr::bind_rows(path_htn_age, path_dm_age)

diet_pal <- c("Low diet score"="#CBD5E1","Mid diet score"="#94A3B8","High diet score"="#475569")

p_path_diet_age <- ggplot(path_all_age,
aes(x = stage, y = pct, fill = diet_ter)) +
geom_col(position = position_dodge(width = 0.75), width = 0.7) +
geom_text(aes(label = sprintf("%.0f%%\n(n=%s)", pct, scales::comma(denom))),
position = position_dodge(width = 0.75),
vjust = -0.35, lineheight = 0.95, size = 3) +
facet_grid(outcome ~ age3, scales = "free_y") +
scale_fill_manual(values = diet_pal, name = "Diet quality (tertiles)") +
labs(title = "Diagnosis → Treatment → Control pathways by Age and Diet quality",
x = NULL, y = "% of adults") +
theme_minimal(base_size = 12) +
theme(panel.grid.major.x = element_blank(),
axis.text.x = element_text(angle = 10, hjust = 1),
plot.title = element_text(face = "bold"))

dm_ctrl_age <- df_age %>%
dplyr::filter(dx_dm == 1L) %>%
dplyr::mutate(med_group = ifelse(dm_meds == 1L, "Medicated", "Unmedicated")) %>%
dplyr::group_by(age3, diet_ter, med_group) %>%
dplyr::summarise(
n_dx = dplyr::n(),
n_ctrl = sum(coalesce(dm_ctrl,0L), na.rm = TRUE),
pct_ctrl = 100 * n_ctrl / pmax(n_dx, 1),
.groups = "drop"
)

p_dm_ctrl_diet_age <- ggplot(dm_ctrl_age,
aes(x = diet_ter, y = pct_ctrl, fill = med_group)) +
geom_col(position = position_dodge(width = 0.75), width = 0.65) +
geom_text(aes(label = sprintf("%.0f%%\n(n=%s)", pct_ctrl, scales::comma(n_dx))),
position = position_dodge(width = 0.75),
vjust = -0.35, lineheight = 0.95, size = 3) +
scale_fill_brewer(palette = "Blues", direction = -1, name = NULL) +
facet_wrap(~ age3, nrow = 1) +
labs(title = "Diabetes control (A1c < 7%) among diagnosed by Diet quality × Age group",
x = "Diet quality (tertiles)", y = "% controlled") +
theme_minimal(base_size = 12) +
theme(panel.grid.major.x = element_blank(),
axis.text.x = element_text(angle = 10, hjust = 1),
plot.title = element_text(face = "bold"))

p_path_diet_age

p_dm_ctrl_diet_age

For summaries, I collapse the pathway plots into tables by group (PIR×diet or age×diet), computing—for each stage (diagnosed, on meds, controlled)—the percentage at each diet tertile and then taking simple high-minus-low diet differences, plus (for diabetes) the medicated vs unmedicated control gap within each stratum.

# 1) Hypertension & diabetes pathways by PIR × diet: High vs Low diet
path_pir_diet_diff <- path_diet_all %>%
  filter(outcome %in% c("Hypertension","Diabetes"),
         diet_ter %in% c("Low diet score","High diet score")) %>%
  select(pir_grp, outcome, diet_ter, stage, pct) %>%
  pivot_wider(names_from = diet_ter, values_from = pct) %>%
  mutate(diff_hi_minus_lo = `High diet score` - `Low diet score`) %>%
  arrange(outcome, pir_grp, stage)

path_pir_diet_diff
## # A tibble: 18 × 6
##    pir_grp  outcome    stage `Low diet score` `High diet score` diff_hi_minus_lo
##    <fct>    <chr>      <fct>            <dbl>             <dbl>            <dbl>
##  1 Low PIR  Diabetes   Cont…             33.3             42.9             9.52 
##  2 Low PIR  Diabetes   On m…             84.9             89.6             4.66 
##  3 Low PIR  Diabetes   Diag…             15.1             13.7            -1.42 
##  4 Mid PIR  Diabetes   Cont…             44.8             45.5             0.663
##  5 Mid PIR  Diabetes   On m…             87.5             78.2            -9.32 
##  6 Mid PIR  Diabetes   Diag…             13.5              9.35           -4.17 
##  7 High PIR Diabetes   Cont…             58.6             54.5            -4.03 
##  8 High PIR Diabetes   On m…             85.7             84.1            -1.62 
##  9 High PIR Diabetes   Diag…             11.4              5.42           -5.98 
## 10 Low PIR  Hypertens… Cont…             61.9             60.2            -1.66 
## 11 Low PIR  Hypertens… On m…              0                0               0    
## 12 Low PIR  Hypertens… Diag…             42.6             40.4            -2.18 
## 13 Mid PIR  Hypertens… Cont…             64.2             66.0             1.83 
## 14 Mid PIR  Hypertens… On m…              0                0               0    
## 15 Mid PIR  Hypertens… Diag…             42.2             34.1            -8.08 
## 16 High PIR Hypertens… Cont…             74.7             74.4            -0.359
## 17 High PIR Hypertens… On m…              0                0               0    
## 18 High PIR Hypertens… Diag…             36.0             28.4            -7.59
# 2) Diabetes control among diagnosed, by PIR × diet and med status
dm_ctrl_pir_diet_med <- dm_ctrl_by_diet_pir %>%
  select(pir_grp, diet_ter, med_group, pct_ctrl) %>%
  pivot_wider(names_from = med_group, values_from = pct_ctrl) %>%
  mutate(diff_med_minus_unmed = Medicated - Unmedicated) %>%
  arrange(pir_grp, diet_ter)

dm_ctrl_pir_diet_med
## # A tibble: 12 × 5
##    pir_grp  diet_ter        Medicated Unmedicated diff_med_minus_unmed
##    <fct>    <fct>               <dbl>       <dbl>                <dbl>
##  1 Low PIR  Low diet score       30.4        50                 -19.6 
##  2 Low PIR  Mid diet score       43.5        70.6               -27.1 
##  3 Low PIR  High diet score      39.1        75                 -35.9 
##  4 Low PIR  <NA>                 15.5        47.6               -32.2 
##  5 Mid PIR  Low diet score       39.3        83.3               -44.0 
##  6 Mid PIR  Mid diet score       49.2        75                 -25.8 
##  7 Mid PIR  High diet score      37.2        75                 -37.8 
##  8 Mid PIR  <NA>                 25.4        21.1                 4.32
##  9 High PIR Low diet score       56.7        70                 -13.3 
## 10 High PIR Mid diet score       38.5        85.7               -47.3 
## 11 High PIR High diet score      48.6        85.7               -37.1 
## 12 High PIR <NA>                 12          69.2               -57.2
## ---- panelD_numbers_age_diet --------------------------------------------
# 3) Hypertension & diabetes pathways by Age × diet: High vs Low diet
path_age_diet_diff <- path_all_age %>%
  filter(outcome %in% c("Hypertension","Diabetes"),
         diet_ter %in% c("Low diet score","High diet score")) %>%
  select(age3, outcome, diet_ter, stage, pct) %>%
  pivot_wider(names_from = diet_ter, values_from = pct) %>%
  mutate(diff_hi_minus_lo = `High diet score` - `Low diet score`) %>%
  arrange(outcome, age3, stage)

path_age_diet_diff
## # A tibble: 18 × 6
##    age3        outcome stage `Low diet score` `High diet score` diff_hi_minus_lo
##    <fct>       <chr>   <fct>            <dbl>             <dbl>            <dbl>
##  1 Younger (1… Diabet… Cont…            33.3              60             26.7   
##  2 Younger (1… Diabet… On m…            50                80             30     
##  3 Younger (1… Diabet… Diag…             1.80              1.68          -0.124 
##  4 Middle (35… Diabet… Cont…            47.1              31.1          -15.9   
##  5 Middle (35… Diabet… On m…            86.8              86.7           -0.0980
##  6 Middle (35… Diabet… Diag…            18.1              11.0           -7.10  
##  7 Older (65+) Diabet… Cont…            45.5              56.0           10.5   
##  8 Older (65+) Diabet… On m…            88.1              86.2           -1.87  
##  9 Older (65+) Diabet… Diag…            25.5              17.4           -8.15  
## 10 Younger (1… Hypert… Cont…            69.2              80             10.8   
## 11 Younger (1… Hypert… On m…             0                 0              0     
## 12 Younger (1… Hypert… Diag…             7.83              8.39           0.558 
## 13 Middle (35… Hypert… Cont…            70.0              70.5            0.572 
## 14 Middle (35… Hypert… On m…             0                 0              0     
## 15 Middle (35… Hypert… Diag…            40.3              29.5          -10.9   
## 16 Older (65+) Hypert… Cont…            63.8              64.1            0.253 
## 17 Older (65+) Hypert… On m…             0                 0              0     
## 18 Older (65+) Hypert… Diag…            63.2              53.7           -9.47
# 4) Diabetes control among diagnosed, by Age × diet and med status
dm_ctrl_age_diet_med <- dm_ctrl_age %>%
  select(age3, diet_ter, med_group, pct_ctrl) %>%
  pivot_wider(names_from = med_group, values_from = pct_ctrl) %>%
  mutate(diff_med_minus_unmed = Medicated - Unmedicated) %>%
  arrange(age3, diet_ter)

dm_ctrl_age_diet_med
## # A tibble: 9 × 5
##   age3            diet_ter        Medicated Unmedicated diff_med_minus_unmed
##   <fct>           <fct>               <dbl>       <dbl>                <dbl>
## 1 Younger (18–34) Low diet score        0          66.7                -66.7
## 2 Younger (18–34) Mid diet score       16.7        66.7                -50  
## 3 Younger (18–34) High diet score      50         100                  -50  
## 4 Middle (35–64)  Low diet score       42.4        77.8                -35.4
## 5 Middle (35–64)  Mid diet score       44.2        80                  -35.8
## 6 Middle (35–64)  High diet score      24.4        75                  -50.6
## 7 Older (65+)     Low diet score       42.9        64.7                -21.8
## 8 Older (65+)     Mid diet score       42.9        66.7                -23.8
## 9 Older (65+)     High diet score      52.1        80                  -27.9

Summary: Across PIR strata, people with high diet scores are consistently less likely to carry a diabetes diagnosis than low-diet peers at the same income level (≈4–6 percentage-point lower prevalence in mid and high PIR), while differences in control among those diagnosed are modest: in low PIR, high-diet diabetics are about 10 pp more likely to be controlled (43% vs 33%), but in mid and high PIR control is similar or slightly worse for the high-diet group. Age patterns echo this: diagnosis rates are much higher in middle and older adults, and within these bands high diet is associated with ~7–8 pp lower diabetes prevalence, whereas in young adults prevalence is ~2% regardless of diet but high-diet youth who are diagnosed show markedly higher medication use and control (≈60% vs 33% controlled, with small n). Finally, comparing medicated vs unmedicated patients within each diet × PIR or diet × age cell, control is always much lower among those on medication (often 20–60 pp), which almost certainly reflects confounding by indication (sicker patients are the ones treated) rather than a causal effect of medication. Taken together, Panel D supports the story that diet quality is more strongly linked to who ends up with diabetes in the first place than to how well diabetes is controlled once a person is in care, with income and age doing most of the work in shaping the control gradients.

Figure E. Ternary plots: diet composition vs A1c and SBP

Up to this point, we have treated “diet quality” as a single scalar score and asked how it interacts with income, age, and behavior load. That lens is useful for equity questions, but it hides what about diet might matter biologically: the balance between overall quality, total energy intake, and sodium load. Public health messaging often collapses these dimensions into a vague “good diet / bad diet” binary; here we explicitly separate them to see whether high-risk individuals cluster in particular corners of the diet–energy–sodium space.

To do this, I construct ternary scatterplots where each point represents an individual’s relative composition of diet quality, calories, and sodium:

Within each sex × age band (age3 × sex2), I convert the diet score, mean daily kcal, and mean daily sodium (averaged across the two 24-hr recalls) into percentile ranks,

Renormalize those three ranks so they sum to 1, giving ternary coordinates for diet, energy, and sodium share,

Then facet the ternary simplex by sex and age band.

In the A1c ternary plot, people with A1c ≥ 6% are colored by A1c and others are grey; in the SBP ternary plot, people in the top SBP decile (or flagged as uncontrolled) are colored by SBP and others are grey. These plots let us ask: among individuals at the highest metabolic or blood-pressure risk, do we see consistent shifts toward “high sodium / high energy” corners, or towards “high diet quality but still high energy/sodium,” beyond what a simple diet score captures?

to_num <- function(x) suppressWarnings(as.numeric(as.character(x)))
row_mean <- function(data, cols){
  if (!length(cols)) return(rep(NA_real_, nrow(data)))
  mats <- lapply(cols, function(cn) suppressWarnings(as.numeric(data[[cn]])))
  rowMeans(as.data.frame(mats), na.rm = TRUE)
}

if (!("age3" %in% names(df1))) {
  df1 <- df1 %>%
    mutate(
      RIDAGEYR = to_num(RIDAGEYR),
      age3 = case_when(
        !is.na(RIDAGEYR) & RIDAGEYR >= 18 & RIDAGEYR < 35 ~ "Younger (18–34)",
        !is.na(RIDAGEYR) & RIDAGEYR >= 35 & RIDAGEYR < 65 ~ "Middle (35–64)",
        !is.na(RIDAGEYR) & RIDAGEYR >= 65                ~ "Older (65+)"
      ),
      age3 = factor(age3, levels = c("Younger (18–34)","Middle (35–64)","Older (65+)"))
    )
}
df1 <- df1 %>%
  mutate(sex2 = case_when(
    RIAGENDR %in% 1 ~ "Male",
    RIAGENDR %in% 2 ~ "Female",
    TRUE ~ NA_character_
  )) %>%
  mutate(sex2 = factor(sex2, levels=c("Male","Female")))

kcal_cols <- grep("^DR[12].*KCAL$", names(df1), value = TRUE)
sod_cols  <- grep("^DR[12].*(SODI|SOD)$", names(df1), value = TRUE)

df1 <- df1 %>%
  mutate(
    kcal   = row_mean(cur_data(), kcal_cols),
    sodium = row_mean(cur_data(), sod_cols),
    a1c    = to_num(a1c)
  )

library(dplyr)
plot_dat <- df1 %>%
  filter(!is.na(sex2), !is.na(age3)) %>%
  filter(rowSums(is.na(select(., diet_score, kcal, sodium))) < 3) %>%
  group_by(sex2, age3) %>%
  mutate(
    r_diet = percent_rank(diet_score),
    r_kcal = percent_rank(kcal),
    r_sod  = percent_rank(sodium)
  ) %>%
  ungroup() %>%
  mutate(
    s = r_diet + r_kcal + r_sod,
    t_diet = ifelse(s > 0, r_diet / s, NA_real_),
    t_kcal = ifelse(s > 0, r_kcal / s, NA_real_),
    t_sod  = ifelse(s > 0, r_sod  / s, NA_real_)
  ) %>%
  filter(is.finite(t_diet), is.finite(t_kcal), is.finite(t_sod))

plot_low  <- plot_dat %>% filter(is.na(a1c) | a1c < 6)
plot_high <- plot_dat %>% filter(!is.na(a1c) & a1c >= 6)

p_tern_a1c <- ggtern() +
  geom_point(
    data = plot_low,
    aes(x = t_kcal, y = t_diet, z = t_sod),
    color = "grey80", size = 1.2, alpha = 0.55, na.rm = TRUE
  ) +
  geom_point(
    data = plot_high,
    aes(x = t_kcal, y = t_diet, z = t_sod, color = a1c),
    size = 1.2, alpha = 0.75, na.rm = TRUE
  ) +
  facet_grid(sex2 ~ age3) +
  Tlab("Energy rank") + Llab("Diet score rank") + Rlab("Sodium rank") +
  scale_color_viridis_c(option = "magma", direction = -1, name = "A1c (%)") +
  theme_bw(base_size = 11) +
  theme(
    legend.position = "right",
    strip.text = element_text(face = "bold")
  ) +
  labs(
    title = "Diet score vs Energy vs Sodium (rank-normalized)",
    subtitle = "Six subgroups (Male/Female × age strata). A1c < 6 shown in grey; ≥6 gets purple scale",
    caption = "Each axis is a within-subgroup percentile rank; components renormalized to sum to 1."
  )
## Warning in geom_point(data = plot_low, aes(x = t_kcal, y = t_diet, z = t_sod),
## : Ignoring unknown aesthetics: z
## Warning in geom_point(data = plot_high, aes(x = t_kcal, y = t_diet, z = t_sod,
## : Ignoring unknown aesthetics: z
p_tern_a1c 
## Ignoring unknown labels:
## • R : "Sodium rank"
## • Rarrow : "Sodium rank"
## • L : "Diet score rank"
## • Larrow : "Diet score rank"
## • T : "Energy rank"
## • Tarrow : "Energy rank"

12.2 SBP ternary plot: diet score vs energy vs sodium

For SBP, I reuse the same ternary coordinate system (diet score, energy, sodium ranks) but:

Rebuild plot_dat only if it does not already exist, to avoid recomputation.

Define “controlled” vs “uncontrolled” BP based on SBP/DBP ≥ 140/90.

Clip extreme SBP values for color scaling and plot uncontrolled individuals in color, with all others in grey.

if (!exists("plot_dat")) {
  library(dplyr)

  to_num   <- function(x) suppressWarnings(as.numeric(as.character(x)))
  rank01   <- function(x) ifelse(is.na(x), NA_real_, (rank(x, na.last="keep", ties.method="average")-0.5)/sum(!is.na(x)))
  renorm3  <- function(a,b,c){
    s <- a+b+c
    tibble(
      t_diet = ifelse(s>0, a/s, NA_real_),
      t_kcal = ifelse(s>0, b/s, NA_real_),
      t_sod  = ifelse(s>0, c/s, NA_real_)
    )
  }
  find_cols <- function(patterns) {
    cn <- names(df1)
    keep <- unlist(lapply(patterns, function(p) grep(p, cn, ignore.case = TRUE, value = TRUE)))
    unique(keep)
  }
  row_mean <- function(data, cols) {
    if (!length(cols)) return(rep(NA_real_, nrow(data)))
    m <- as.data.frame(lapply(cols, function(cn) suppressWarnings(as.numeric(data[[cn]]))))
    rowMeans(m, na.rm = TRUE)
  }

  kcal_cols <- find_cols(c("^DR1TKCAL$","^DR2TKCAL$"))
  sod_cols  <- find_cols(c("^DR1TSODI$","^DR2TSODI$"))
  stopifnot("diet_score" %in% names(df1))

  df_tmp <- df1 %>%
    mutate(
      kcal   = row_mean(cur_data(), kcal_cols),
      sodium = row_mean(cur_data(), sod_cols),
      sex2   = case_when(!!as.name("RIAGENDR") %in% 1 ~ "Male",
                         !!as.name("RIAGENDR") %in% 2 ~ "Female",
                         TRUE ~ "Unknown"),
      age3   = case_when(!is.na(RIDAGEYR) & RIDAGEYR >= 18 & RIDAGEYR < 35 ~ "Younger (18–34)",
                         !is.na(RIDAGEYR) & RIDAGEYR >= 35 & RIDAGEYR < 65 ~ "Middle (35–64)",
                         !is.na(RIDAGEYR) & RIDAGEYR >= 65                 ~ "Older (65+)",
                         TRUE ~ NA_character_),
      sex2   = factor(sex2,  levels=c("Male","Female")),
      age3   = factor(age3,  levels=c("Younger (18–34)","Middle (35–64)","Older (65+)"))
    ) %>%
    filter(!is.na(sex2), !is.na(age3))

  plot_dat <- df_tmp %>%
    group_by(sex2, age3) %>%
    mutate(
      r_diet = rank01(diet_score),
      r_kcal = rank01(kcal),
      r_sod  = rank01(sodium)
    ) %>%
    ungroup() %>%
    bind_cols(renorm3(.$r_diet, .$r_kcal, .$r_sod)) %>%
    mutate(
      sbp = to_num(sbp),
      dbp = to_num(dbp)
    )
}

bp_low  <- plot_dat %>%
  filter(is.na(sbp) | is.na(dbp) | (sbp < 140 & dbp < 90))

bp_high <- plot_dat %>%
  filter(!is.na(sbp) & !is.na(dbp) & (sbp >= 140 | dbp >= 90)) %>%
  mutate(sbp_clip = pmax(pmin(sbp, 200), 90))

p_tern_bp <- ggtern() +
  geom_point(
    data = bp_low,
    aes(x = t_kcal, y = t_diet, z = t_sod),
    color = "grey80", size = 1.2, alpha = 0.55, na.rm = TRUE
  ) +
  geom_point(
    data = bp_high,
    aes(x = t_kcal, y = t_diet, z = t_sod, color = sbp_clip),
    size = 1.2, alpha = 0.8, na.rm = TRUE
  ) +
  facet_grid(sex2 ~ age3) +
  Tlab("Energy rank") + Llab("Diet score rank") + Rlab("Sodium rank") +
  scale_color_viridis_c(option = "magma", direction = -1, name = "SBP (mmHg)") +
  theme_bw(base_size = 11) +
  theme(
    legend.position = "right",
    strip.text = element_text(face = "bold")
  ) +
  labs(
    title = "Diet score vs Energy vs Sodium (rank-normalized)",
    subtitle = "Six subgroups (Male/Female × age strata). Grey = SBP<140 & DBP<90; colored = uncontrolled",
    caption = "Each axis is a within-subgroup percentile rank; components renormalized to sum to 1."
  )
## Warning in geom_point(data = bp_low, aes(x = t_kcal, y = t_diet, z = t_sod), :
## Ignoring unknown aesthetics: z
## Warning in geom_point(data = bp_high, aes(x = t_kcal, y = t_diet, z = t_sod, :
## Ignoring unknown aesthetics: z
p_tern_bp
## Ignoring unknown labels:
## • R : "Sodium rank"
## • Rarrow : "Sodium rank"
## • L : "Diet score rank"
## • Larrow : "Diet score rank"
## • T : "Energy rank"
## • Tarrow : "Energy rank"

Numerical Analysis: For each outcome (A1c, SBP), I identify the top 5% of values and compare their average ternary composition (mean diet/energy/sodium ranks) with the remaining 90%, overall and optionally within age or income strata.

# --- Build ternary coords: diet score, kcal, sodium -----------------
diet_ternary <- df1 %>%
  mutate(
    # average across recall days (change SODI names if yours differ)
    kcal_mean   = rowMeans(across(c(DR1TKCAL, DR2TKCAL)), na.rm = TRUE),
    sodium_mean = rowMeans(across(c(DR1TSODI, DR2TSODI)), na.rm = TRUE)
  ) %>%
  group_by(sex2, age3) %>%
  mutate(
    r_diet   = percent_rank(diet_score),
    r_kcal   = percent_rank(kcal_mean),
    r_sodium = percent_rank(sodium_mean),

    r_sum        = r_diet + r_kcal + r_sodium,
    tern_diet    = r_diet   / r_sum,
    tern_kcal    = r_kcal   / r_sum,
    tern_sodium  = r_sodium / r_sum
  ) %>%
  ungroup()

# --- A1c: top 10% vs others ----------------------------------------
a1c_cut <- quantile(diet_ternary$a1c, 0.95, na.rm = TRUE)

a1c_comp <- diet_ternary %>%
  filter(!is.na(a1c)) %>%
  mutate(top_a1c = a1c >= a1c_cut) %>%
  group_by(top_a1c) %>%
  summarise(
    n          = n(),
    mean_diet  = mean(tern_diet,   na.rm = TRUE),
    mean_kcal  = mean(tern_kcal,   na.rm = TRUE),
    mean_sodium= mean(tern_sodium, na.rm = TRUE)
  )

a1c_comp
## # A tibble: 2 × 5
##   top_a1c     n mean_diet mean_kcal mean_sodium
##   <lgl>   <int>     <dbl>     <dbl>       <dbl>
## 1 FALSE    6372     0.360     0.319       0.321
## 2 TRUE      343     0.376     0.292       0.332
# --- SBP: top 10% vs others ----------------------------------------
sbp_cut <- quantile(diet_ternary$sbp, 0.95, na.rm = TRUE)

sbp_comp <- diet_ternary %>%
  filter(!is.na(sbp)) %>%
  mutate(top_sbp = sbp >= sbp_cut) %>%
  group_by(top_sbp) %>%
  summarise(
    n          = n(),
    mean_diet  = mean(tern_diet,   na.rm = TRUE),
    mean_kcal  = mean(tern_kcal,   na.rm = TRUE),
    mean_sodium= mean(tern_sodium, na.rm = TRUE)
  )

sbp_comp
## # A tibble: 2 × 5
##   top_sbp     n mean_diet mean_kcal mean_sodium
##   <lgl>   <int>     <dbl>     <dbl>       <dbl>
## 1 FALSE    7138     0.353     0.321       0.326
## 2 TRUE      380     0.403     0.305       0.292

The ternary summaries suggest subtle but real composition shifts in the high-risk tails. For A1c, the top 5% (n = 343) tilt slightly toward higher diet-score share (0.38 vs 0.36) and lower energy share (0.29 vs 0.32), with sodium nearly unchanged (0.33 vs 0.32). For SBP, the top 10% (n = 380) show a clearer move toward diet quality (0.40 vs 0.35) and away from sodium (0.29 vs 0.33), with a small drop in energy (0.31 vs 0.32). These shifts are modest in magnitude, but they hint that risk may concentrate in particular diet–energy–sodium mixes rather than in diet score alone—patterns that would need finer stratification and modeling to characterize more definitively.

Conclusion - Across tiles and ridges, income and age drive the steepest gradients in diagnosis and uncontrolled BP; diet tertiles mostly add small adjustments once we condition on these factors.
- Within BLI, PIR, and age strata, moving from low → high diet quality only modestly shifts A1c and SBP, and often tracks more with care exposure (access, blood testing, medication use) than with clearly lower underlying risk.
- Pathway plots suggest that better diet is associated with somewhat higher diabetes control, especially among older and higher-income adults on medication, but these gains are modest and unevenly distributed.
- Taken together, the project suggests that “diet quality” in this dataset is entangled with structural factors (income, age, access, treatment) rather than acting as a clean standalone lever, pointing future work toward disentangling behavior from access and testing policy- or clinic-level interventions, not just individual diet change.

Analytic limitations:

Despite these constraints, the combination of structured heatmaps, distributional ridges, and pathway plots still surfaces clear, interpretable patterns—showing how diet tends to co-vary with income, age, and behavior load in ways that shape diagnosis and control. As a scaffold, this analysis sets up a concrete, reproducible framework that future work can extend with proper survey weighting, richer diet metrics, and fully adjusted models.